• Open

    AI generated music video
    submitted by /u/aquin1313 [link] [comments]  ( 40 min )
    What if Patrick Bateman was a Data Scientist? An AI-Generated Video
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )

  • Open

    InstructPix2Pix Video: One Vase, Many Artists
    submitted by /u/anitakirkovska [link] [comments]  ( 40 min )
    Make-a-Video3D: Meta generates 3D scenes from text
    submitted by /u/henlo_there_fren [link] [comments]  ( 40 min )
    AI Characters Become Self-Aware Ft. MrBeast - (Funny Moments)
    submitted by /u/SnowDustHD [link] [comments]  ( 40 min )
    AI Dream 150 - INTERSTELLAR PORTAL Part4 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI will also help you find a job: job offers related to this area grow by 31%
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    What are some AI models out there that are not owned by large businesses that require my personal info to use?
    submitted by /u/temporaryAMA [link] [comments]  ( 40 min )
    Explain Cinderella as if you are Donald Trump
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Should I be learning data analysis now if my goal to become an AI engineer/researcher?
    Hello, I'm currently self-studying math with the goal of learning up to liner-algebra and then returning to school for a BS in mathematics followed by a graduate and then post-graduate degree in a subjecting relating to AI research. Given this goal I was wondering if I should also be self-studying data analysis along with the math. I reason that this might be good since I'd be getting hands-on experience with data and I might even be able to do it as a job while I study. In my math studies I'm currently on Algebra 2 so if I should be learning data analysis, can I start learning it now or would it be better to start after I've learned more advanced math like calculus? submitted by /u/multioutletplug [link] [comments]  ( 42 min )
    Hello guys. Could you please share experience with learning and working with Neural networks? I have already done with some of the tutorials, but want to know which language is best suitable to create AI (Python, Java).
    submitted by /u/Your_mag [link] [comments]  ( 42 min )
    Google's MusicLM is Astoundingly Good at Making AI-Generated Music, But They're Not Releasing it Due to Copyright Concerns
    submitted by /u/HODLTID [link] [comments]  ( 45 min )
    Research
    Hello everyone, I am currently doing a college research on artificial intelligence and ethics, looking to specify in the field of Bias and fairness in AI. I am looking for someone with some knowledge in the field to conduct a brief interview with in which I ask a couple of questions related to the topic. If you are interested, kindly @ me. Thanks :) submitted by /u/AyanT_01 [link] [comments]  ( 40 min )
    ChatGPT is not all that good
    I get inaccurate information form ChatGPT when I ask questions in areas where I have a good bit of knowledge. I couple of times it has cited specific books or papers by title and author and, as far as I can tell, the books/papers don't exists. The second time it did this, I ask it why it makes this kind of odd mistakes and it said: "As a language model, my information is based on the text that I have been trained on, and I try to provide the most accurate and relevant information based on that data. However, I am not infallible and sometimes make mistakes or provide information that is inaccurate or out-of-date. I apologize for any inaccuracies in my previous responses, and for any confusion that they may have caused. It's important to know that the information provided by AI models like me should be taken as a starting point for further research and not as the final answer. It's always a good idea to consult with experts or verify information with multiple sources. Again, I apologize for any inaccuracies in my previous statements and any confusion they may have caused." But it is still weird that it would make up specific books and papers, I don't see why a language model would do that. Edit: Maybe the text that it is trained on references specific books and paper that don't exist. And I guess that would be text that are not indexed by Google and other places where I am doing my fact checking such that I cannot figure out where it is getting this stuff. submitted by /u/facinabush [link] [comments]  ( 46 min )
    Made a website to teach you how to make passive/active income using AI!
    I recently created a website called https://cashwithai.com that is dedicated to helping people learn how to make money using AI like ChatGPT. The website offers a variety of resources, including a QuickStart guide, case studies, and tips and tricks for monetizing AI-generated content. Additionally, I'm offering free 1-on-1 consultations to anyone who is looking for personalized advice and guidance on how to make money with AI. I'm not running ads or charging; I run purely off donations. Let me know if you have any questions! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    New Incredible Text-To-Music Generation Model By Google
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
    What is the information content in a logic statement?
    I recently had a lot of fun working with my talented peers at IBM Research putting together a point of view to this fascinating question. Wanted to share with you our work and to get feedback from you. ibm.biz/logic_and_information submitted by /u/Consistent_Listen127 [link] [comments]  ( 40 min )
    AI will not replace marketers, but marketers who use AI will replace those who don’t.
    AI will not replace marketers, but marketers who use AI will replace those who don’t. submitted by /u/TheVellerShow [link] [comments]  ( 40 min )
    OpenAI Hired Developers to Train its AI to Replace Developers
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 24 min )
    image to voice ai stable image to voice
    submitted by /u/VNKT-FOREVER [link] [comments]  ( 40 min )
    Real world applications of AI
    There has been many advancements in especially in CV, NLP etc. But, I don't see many new AI models that are solving real world problems. There has been a lot of advancements in AI in RL, DL etc but I rarely see new applications in real world. All i see are text to image models, advanced chat bots, better game playing AIs etc. These are definitely amazing, but, I wanna see new stuff which is making a good impact on the real world. Alpha-fold, driverless cars etc are sorta things I am looking for. I don't know if I am just bluntly unaware of new stuff in AI which solves practical problems, or whether not much new stuff is happening in those areas. So, I would be glad if I can know more about how AI is being used in new ways to solve real-wprld problems, or any new AI research trying to tackle a real world problem? Sry if it's a stupid question, and sry for spamming "real world" too much. submitted by /u/Unhappy_Version7565 [link] [comments]  ( 44 min )
    Want to catch up on the breakthroughs in AI the last 10 years. What should I read?
    Hi guys, I have a technical background and have studied some AI in the past. I'd like to catch up on the latest developments in AI over the last ~10 years. Do you have any recommendations on what to read? I was thinking of maybe trying to get the top most cited papers in the last 10 years and reading them. But I'm not sure where would be the best place to find that. Any suggestions? Thanks in advance. submitted by /u/AlexWD [link] [comments]  ( 41 min )
    spooky season 💅
    submitted by /u/Rich_Dragon17 [link] [comments]  ( 40 min )
    BuzzFeed Plans to 'Hire' ChatGPT as Newest Content Creator
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    [P] tiny-diffusion: a minimal PyTorch implementation of probabilistic diffusion models for 2D datasets
    submitted by /u/tanelai [link] [comments]  ( 43 min )
    [D] Running a small Bloom checkpoint on a mini pc.
    Has anyone been able to run a small LLM checkpoint in commodity hardware, like a mini PC? If so, what were your specs? submitted by /u/onedjscream [link] [comments]  ( 42 min )
    [D] could multiple-input transformers reduce the pain of the training data acquisition problem?
    so it's big pop-sci news that we are running out of quality textual training data (soft-paywalled article, but you get the idea) to produce chinchilla-optimal language models, and they appear to continue learning new abilities as data and parameter size increase. when an infant learns what a cat is, it is not only described, but the infant can see it and understand its form and behavior in a way it can then go on to describe and extrapolate from (even if they are blind, they can touch it and understand its shape and feel its fur). LLMs have to do this the hard way: their generalized understanding of the shape and behaviour of a cat comes from textual descriptions of them (and they would need quite a lot in order to understand!) most of the research i have seen into multiple input transformer models has been with the purpose of task completion (google's embodied language model robot butlers etc, which often use textual descriptions fed to a normal LLM, see https://innermonologue.github.io/ ) or image recognition and understanding (such as in CLIP) but not necessarily applying it to textual completion, which seems like it could benefit from a more visual understanding of the world so, in the medium or short term, to improve performance on text-completion tasks, what are your thoughts on using image training as well as textual to improve generalization for LLMs with fewer text tokens on a new architecture? (also, please excuse any ignorance i may posess: i'm a bit of an armchair ai enjoyer) submitted by /u/Dankmemexplorer [link] [comments]  ( 44 min )
    GAN Training Gradient questions [D]
    The main reason why this is not in the simple questions thread is the need to includee an image. Here is an image of my generator's and discriminator's gradients being logged onto wandb. As you can see, they have these weird hasms, where the distribution of the gradients becomes very close to zero. These chasms are correlated for the discriminator and generator and seem very regular. Anyone experienced anything like this and maybe has a hunch of what might be the cause? https://preview.redd.it/3qipyez6stea1.png?width=624&format=png&auto=webp&s=f5fb39803082add2fd78979e880e6e784b9e4c0c submitted by /u/Hub_Pli [link] [comments]  ( 43 min )
    [D] Goodness of fit question
    For regressions, R-squared and Adj. R-Squared are typically considered the primary goodness-of-fit measures. But in many supervised machine-learning models, RMSE is the main measure that I keep running across. For example, decision tree models that I create in R using Rpart do that. So, my question is how to compare the predictive accuracy of OLS regression models that report R-sq to equivalent Rpart regression trees that report RMSE. submitted by /u/bridgeton_man [link] [comments]  ( 43 min )
    [N] OpenAI has 1000s of contractors to fine-tune codex
    submitted by /u/yazriel0 [link] [comments]  ( 45 min )
    [P] GPT JupyterLab - JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells.
    Hi everyone, I made a JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells. This extension passes your current notebook cell to the GPT API and completes your code/text for you. You can customize the GPT parameters in the Advanced Settings menu. I made this extension when I couldn't find any Copilot/Codex extensions for JupyterLab. It doesn't make sense that ML folks don't have an easy way to use AI generated code in their own tools. VS Code does allow you use Copilot, but I've gotten used to Jupyter and a lot of ML/DS folks I know still prefer using Jupyter over VS code. Installation pip install gpt_jupyterlab GitHub Repo: https://github.com/henshinger/gpt-jupyterlab/ Demo GPT JupyterLab Demo Note: You will need your own OpenAI API Key to use this extension. Would love to get your feedback! submitted by /u/henshinger [link] [comments]  ( 44 min )
    [P] Launching my first ever open-source project and it might make your ChatGPT answers better
    I am building an open-source ML observability and refinement toolkit which recently got investment from YCombinator. The tool helps ML practitioners to: 1. Understand how their models are performing in production 2. Catch edge-cases and outliers to help them refine their models 3. Allow them to customise the tool according to their needs (hence, open-source) 4. Bring data-security at the forefront (hence, self hosted) You can check out the project https://github.com/uptrain-ai/uptrain and would love to hear feedback from the community submitted by /u/Vegetable-Skill-9700 [link] [comments]  ( 43 min )
    [P] Parse research papers into structured data
    ​ https://preview.redd.it/1t7spoqxdsea1.png?width=1920&format=png&auto=webp&s=b4643e418d942260b16019ee250edf56a4336b4b paperai is a semantic search and workflow application for medical/scientific papers. It can be used to take a set of research papers, parse the content and turn it into structured data. paperetl is the underlying library that parses basic metadata such as title, publication and date out of the papers. https://preview.redd.it/5jywynarfsea1.png?width=1084&format=png&auto=webp&s=437b6cdb65fbf12d6d238f84fd94ec4b85dec93b In addition to standard metadata, paperai can also run extractive queries to build additional columns. https://preview.redd.it/8rzfj016esea1.png?width=1138&format=png&auto=webp&s=ccd73c34f0f3001dfe05a0f8c480cf73451dd447 Example notebooks and Docker files can be found on GitHub. paperai | paperetl submitted by /u/davidmezzetti [link] [comments]  ( 43 min )
    [P] Annotating text with sparse human annotations and different length chunks
    I want to automate the annotation of a domain-specific text (complicated contracts) by finetuning a BERT model. However the annotated text I've been provided by the domain experts has been sparsely annotated (i.e. paragraph 40-48 has been fully annotated, while 15 other paragraphs only have certain classes annotated for certain words. Most paragraphs have nothing annotated (like 70% of the entire corpus) Another complication is that for 1 class, the entire paragraph should be annotated in this class, while for others it's a single word or a sentence. There are 7 classes in total and in the end, all tokens should be annotated to one of the 7 classes. 6 out of 7 classes are also pretty domain-specific and not something like 'location' or 'person' or POS. I've been thinking about using an annotation framework (i.e. LabelStudio or Prodigy) that supports custom models (i.e. finetuned BERT) and active learning to rapidly increase annotated texts by speeding up the work of human annotators (which are domain experts and usually don't have a lot of time for this). However, it's pretty unclear whether my use case would be support by this, especially with the issue of sparse text. I've also considered making the problem easier by using another model for the specific class that applies to an entire paragraph. And/or by using the fully annotated paragraphs for finetuning and using the sparse paragraphs for validation. A final consideration is using GPT-3, but I'm not sure how/if it is able to classify entire sentences/paragraphs with multiple classes and how the prompt should be formatted as. Any suggestions/ideas? submitted by /u/Background_Claim7907 [link] [comments]  ( 44 min )
    [P] Framework agnostic python package for running RWKV, RNN based models.
    https://pypi.org/project/rwkvstic/ Currently supports tensorflow, pytorch, jax Also has support for tensor streaming, 8bit jit-quant and multi-gpu. Run RWKV 7B on 8GB of vram or 14B on 16GB of vram. submitted by /u/hazardous1222 [link] [comments]  ( 42 min )
    [R] META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 45 min )
    [D] Could forward-forward learning enable training large models with distributed computing?
    One problem with distributed learning with backprop is that the first layer can't update their weights until the computation has travelled all the way down to the last layer and then backpropagated back up. If all your layers are on different machines connected by a high-latency internet connection, this will take a long time. In forward-forward learning, learning is local - each layer operates independently and only needs to communicate with the layers above and below it. The results are almost-but-not-quite as good as backprop. But each layer can immediately update their weights based only on the information they received from the previous layer. Network latency no longer matters; the limit is just the bandwidth of the slowest machine. submitted by /u/currentscurrents [link] [comments]  ( 46 min )
  • Open

    Laptop Recommendations for RL
    I am looking to buy a laptop for my rl projects and I wanted to know what people in the industry recommended for training models locally and how significant OS, CPU and GPUs really are. submitted by /u/PleasantBase6967 [link] [comments]  ( 43 min )
    The value of RL feedback on language models: "[Character.ai] engagement rose by more than 30 percent." --Noam Shazeer
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Optimal machine to run 350M language model on?
    Trying to set up a discord bot. submitted by /u/Ah_Books [link] [comments]  ( 40 min )
  • Open

    Don't overfit the history -- Recursive time series data augmentation. (arXiv:2207.02891v2 [cs.LG] UPDATED)
    Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks, we need to understand that we fit our model on available data, which is a unique realized history. Training on a single realization often induces severe overfitting lacking generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call Recursive Interpolation Method, denoted as RIM. New samples are generated using a recursive interpolation function of all previous values in such a way that the enhanced samples preserve the original inherent time series dynamics. We perform theoretical analysis to characterize the proposed RIM and to guarantee its test performance. We apply RIM to diverse real world time series cases to achieve strong performance over non-augmented data on regression, classification, and reinforcement learning tasks.  ( 2 min )
    Interaction Decompositions for Tensor Network Regression. (arXiv:2208.06029v2 [cs.LG] UPDATED)
    It is well known that tensor network regression models operate on an exponentially large feature space, but questions remain as to how effectively they are able to utilize this space. Using a polynomial featurization, we propose the interaction decomposition as a tool that can assess the relative importance of different regressors as a function of their polynomial degree. We apply this decomposition to tensor ring and tree tensor network models trained on the MNIST and Fashion MNIST datasets, and find that up to 75% of interaction degrees are contributing meaningfully to these models. We also introduce a new type of tensor network model that is explicitly trained on only a small subset of interaction degrees, and find that these models are able to match or even outperform the full models using only a fraction of the exponential feature space. This suggests that standard tensor network models utilize their polynomial regressors in an inefficient manner, with the lower degree terms being vastly under-utilized.  ( 2 min )
    MusicLM: Generating Music From Text. (arXiv:2301.11325v1 [cs.SD])
    We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.  ( 2 min )
    f-divergences and their applications in lossy compression and bounding generalization error. (arXiv:2206.11042v3 [cs.IT] UPDATED)
    In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $f$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $f$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $f$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the $f$-rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.  ( 2 min )
    Overcoming the Pitfalls of Prediction Error in Operator Learning for Bilevel Planning. (arXiv:2208.07737v2 [cs.AI] UPDATED)
    Bilevel planning, in which a high-level search over an abstraction of an environment is used to guide low-level decision-making, is an effective approach to solving long-horizon tasks in continuous state and action spaces. Recent work has shown how to enable such bilevel planning by learning action and transition model abstractions in the form of symbolic operators and neural samplers. In this work, we show that existing symbolic operator learning approaches fall short in many natural environments where agent actions tend to cause a large number of irrelevant propositions to change. This is primarily because they attempt to learn operators that optimize the prediction error with respect to observed changes in the propositions. To overcome this issue, we propose to learn operators that only model changes necessary for abstract planning to achieve the specified goal. Experimentally, we show that our approach learns operators that lead to efficient planning across 10 different hybrid robotics domains, including 4 from the challenging BEHAVIOR-100 benchmark, with generalization to novel initial states, goals, and objects.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v3 [cs.CV] UPDATED)
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose Bridge-Tower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at \url{https://github.com/microsoft/BridgeTower}.  ( 2 min )
    Ontology-enhanced Prompt-tuning for Few-shot Learning. (arXiv:2201.11332v1 [cs.CL] CROSS LISTED)
    Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on the external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks, including relation extraction, event extraction, and knowledge graph completion, with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.  ( 2 min )
    Predicting Wind-Driven Spatial Deposition through Simulated Color Images using Deep Autoencoders. (arXiv:2202.01762v3 [cs.LG] UPDATED)
    For centuries, scientists have observed nature to understand the laws that govern the physical world. The traditional process of turning observations into physical understanding is slow. Imperfect models are constructed and tested to explain relationships in data. Powerful new algorithms can enable computers to learn physics by observing images and videos. Inspired by this idea, instead of training machine learning models using physical quantities, we used images, that is, pixel information. For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns. These phenomena include features in Aeolian dunes and volcanic ash deposition, wildfire smoke, and air pollution plumes. We use computer model simulations of spatial deposition patterns to approximate images from a hypothetical imaging device whose outputs are red, green, and blue (RGB) color images with channel values ranging from 0 to 255. In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality. Reducing the data dimension size with an encoder enables training deep, fully connected neural network models linking geographic and meteorological scalar input quantities to the encoded space. Once this is achieved, full spatial patterns are reconstructed using the decoder. We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size, and the full predictive model performance on test data achieves a normalized root mean squared error of 8%, a figure of merit in space of 94% and a precision-recall area under the curve of 0.93.  ( 3 min )
    Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models. (arXiv:2212.10561v2 [cs.CL] UPDATED)
    Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Adaptive Gradient Methods with Local Guarantees. (arXiv:2203.01400v3 [cs.LG] UPDATED)
    Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.  ( 2 min )
    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning. (arXiv:2301.10921v1 [cs.LG])
    The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.  ( 2 min )
    Explaining Visual Biases as Words by Generating Captions. (arXiv:2301.11104v1 [cs.LG])
    We aim to diagnose the potential biases in image classifiers. To this end, prior works manually labeled biased attributes or visualized biased features, which need high annotation costs or are often ambiguous to interpret. Instead, we leverage two types (generative and discriminative) of pre-trained vision-language models to describe the visual bias as a word. Specifically, we propose bias-to-text (B2T), which generates captions of the mispredicted images using a pre-trained captioning model to extract the common keywords that may describe visual biases. Then, we categorize the bias type as spurious correlation or majority bias by checking if it is specific or agnostic to the class, based on the similarity of class-wise mispredicted images and the keyword upon a pre-trained vision-language joint embedding space, e.g., CLIP. We demonstrate that the proposed simple and intuitive scheme can recover well-known gender and background biases, and discover novel ones in real-world datasets. Moreover, we utilize B2T to compare the classifiers using different architectures or training methods. Finally, we show that one can obtain debiased classifiers using the B2T bias keywords and CLIP, in both zero-shot and full-shot manners, without using any human annotation on the bias.  ( 2 min )
    On the Convergence of No-Regret Learning Dynamics in Time-Varying Games. (arXiv:2301.11241v1 [cs.LG])
    Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of \emph{optimistic gradient descent (OGD)} in time-varying games by drawing a strong connection with \emph{dynamic regret}. Our framework yields sharp convergence bounds for the equilibrium gap of OGD in zero-sum games parameterized on the \emph{minimal} first-order variation of the Nash equilibria and the second-order variation of the payoff matrices, subsuming known results for static games. Furthermore, we establish improved \emph{second-order} variation bounds under strong convexity-concavity, as long as each game is repeated multiple times. Our results also apply to time-varying \emph{general-sum} multi-player games via a bilinear formulation of correlated equilibria, which has novel implications for meta-learning and for obtaining refined variation-dependent regret bounds, addressing questions left open in prior papers. Finally, we leverage our framework to also provide new insights on dynamic regret guarantees in static games.  ( 2 min )
    Box$^2$EL: Concept and Role Box Embeddings for the Description Logic EL++. (arXiv:2301.11118v1 [cs.AI])
    Representation learning in the form of semantic embeddings has been successfully applied to a variety of tasks in natural language processing and knowledge graphs. Recently, there has been growing interest in developing similar methods for learning embeddings of entire ontologies. We propose Box$^2$EL, a novel method for representation learning of ontologies in the Description Logic EL++, which represents both concepts and roles as boxes (i.e. axis-aligned hyperrectangles), such that the logical structure of the ontology is preserved. We theoretically prove the soundness of our model and conduct an extensive empirical evaluation, in which we achieve state-of-the-art results in subsumption prediction, link prediction, and deductive reasoning. As part of our evaluation, we introduce a novel benchmark for evaluating EL++ embedding models on predicting subsumptions involving both atomic and complex concepts.  ( 2 min )
    Finding Regions of Counterfactual Explanations via Robust Optimization. (arXiv:2301.11113v1 [cs.LG])
    Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.  ( 2 min )
    Neural Inverse Operators for Solving PDE Inverse Problems. (arXiv:2301.11167v1 [cs.LG])
    A large class of inverse problems for PDEs are only well-defined as mappings from operators to functions. Existing operator learning frameworks map functions to functions and need to be modified to learn inverse maps from data. We propose a novel architecture termed Neural Inverse Operators (NIOs) to solve these PDE inverse problems. Motivated by the underlying mathematical structure, NIO is based on a suitable composition of DeepONets and FNOs to approximate mappings from operators to functions. A variety of experiments are presented to demonstrate that NIOs significantly outperform baselines and solve PDE inverse problems robustly, accurately and are several orders of magnitude faster than existing direct and PDE-constrained optimization methods.  ( 2 min )
    On the Mathematics of Diffusion Models. (arXiv:2301.11108v1 [cs.LG])
    This paper attempts to present the stochastic differential equations of diffusion models in a manner that is accessible to a broad audience. The diffusion process is defined over a population density in R^d. Of particular interest is a population of images. In a diffusion model one first defines a diffusion process that takes a sample from the population and gradually adds noise until only noise remains. The fundamental idea is to sample from the population by a reverse-diffusion process mapping pure noise to a population sample. The diffusion process is defined independent of any ``interpretation'' but can be analyzed using the mathematics of variational auto-encoders (the ``VAE interpretation'') or the Fokker-Planck equation (the ``score-matching intgerpretation''). Both analyses yield reverse-diffusion methods involving the score function. The Fokker-Planck analysis yields a family of reverse-diffusion SDEs parameterized by any desired level of reverse-diffusion noise including zero (deterministic reverse-diffusion). The VAE analysis yields the reverse-diffusion SDE at the same noise level as the diffusion SDE. The VAE analysis also yields a useful expression for computing the population probabilities of a given point (image). This formula for the probability of a given point does not seem to follow naturally from the Fokker-Planck analysis. Much, but apparently not all, of the mathematics presented here can be found in the literature. Attributions are given at the end of the paper.  ( 2 min )
    Causal Counterfactuals for Improving the Robustness of Reinforcement Learning. (arXiv:2211.05551v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is applied in a wide variety of fields. RL enables agents to learn tasks autonomously by interacting with the environment. The more critical the tasks are, the higher the demand for the robustness of the RL systems. Causal RL combines RL and causal inference to make RL more robust. Causal RL agents use a causal representation to capture the invariant causal mechanisms that can be transferred from one task to another. Currently, there is limited research in Causal RL, and existing solutions are usually not complete or feasible for real-world applications. In this work, we propose CausalCF, the first complete Causal RL solution incorporating ideas from Causal Curiosity and CoPhy. Causal Curiosity provides an approach for using interventions, and CoPhy is modified to enable the RL agent to perform counterfactuals. We apply CausalCF to complex robotic tasks and show that it improves the RL agent's robustness using a realistic simulation environment called CausalWorld.  ( 2 min )
    New Approach to Malware Detection Using Optimized Convolutional Neural Network. (arXiv:2301.11161v1 [cs.CR])
    Cyber-crimes have become a multi-billion-dollar industry in the recent years. Most cybercrimes/attacks involve deploying some type of malware. Malware that viciously targets every industry, every sector, every enterprise and even individuals has shown its capabilities to take entire business organizations offline and cause significant financial damage in billions of dollars annually. Malware authors are constantly evolving in their attack strategies and sophistication and are developing malware that is difficult to detect and can lay dormant in the background for quite some time in order to evade security controls. Given the above argument, Traditional approaches to malware detection are no longer effective. As a result, deep learning models have become an emerging trend to detect and classify malware. This paper proposes a new convolutional deep learning neural network to accurately and effectively detect malware with high precision. This paper is different than most other papers in the literature in that it uses an expert data science approach by developing a convolutional neural network from scratch to establish a baseline of the performance model first, explores and implements an improvement model from the baseline model, and finally it evaluates the performance of the final model. The baseline model initially achieves 98% accurate rate but after increasing the depth of the CNN model, its accuracy reaches 99.183 which outperforms most of the CNN models in the literature. Finally, to further solidify the effectiveness of this CNN model, we use the improved model to make predictions on new malware samples within our dataset.  ( 2 min )
    Planning Automated Driving with Accident Experience Referencing and Common-sense Inferencing. (arXiv:2301.10892v1 [cs.RO])
    Although a typical autopilot system far surpasses humans in term of sensing accuracy, performance stability and response agility, such a system is still far behind humans in the wisdom of understanding an unfamiliar environment with creativity, adaptivity and resiliency. Current AD brains are basically expert systems featuring logical computations, which resemble the thinking flow of a left brain working at tactical level. A right brain is needed to upgrade the safety of automated driving vehicle onto next generation by making intuitive strategical judgements that can supervise the tactical action planning. In this work, we present the concept of an Automated Driving Strategical Brain (ADSB): a framework of a scene perception and scene safety evaluation system that works at a higher abstraction level, incorporating experience referencing, common-sense inferring and goal-and-value judging capabilities, to provide a contextual perspective for decision making within automated driving planning. The ADSB brain architecture is made up of the Experience Referencing Engine (ERE), the Common-sense Referencing Engine (CIE) and the Goal and Value Keeper (GVK). 1,614,748 cases from FARS/CRSS database of NHTSA in the period 1975 to 2018 are used for the training of ERE model. The kernel of CIE is a trained model, COMET-BART by ATOMIC, which can be used to provide directional advice when tactical-level environmental perception conclusions are ambiguous; it can also use future scenario models to remind tactical-level decision systems to plan ahead of a perceived hazard scene. GVK can take in any additional expert-hand-written rules that are of qualitative nature. Moreover, we believe that with good scalability, the ADSB approach provides a potential solution to the problem of long-tail corner cases encountered in the validation of a rule-based planning algorithm.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    Efficient Hyperdimensional Computing. (arXiv:2301.10902v1 [cs.LG])
    Hyperdimensional computing (HDC) uses binary vectors of high dimensions to perform classification. Due to its simplicity and massive parallelism, HDC can be highly energy-efficient and well-suited for resource-constrained platforms. However, in trading off orthogonality with efficiency, hypervectors may use tens of thousands of dimensions. In this paper, we will examine the necessity for such high dimensions. In particular, we give a detailed theoretical analysis of the relationship among dimensions of hypervectors, accuracy, and orthogonality. The main conclusion of this study is that a much lower dimension, typically less than 100, can also achieve similar or even higher detecting accuracy compared with other state-of-the-art HDC models. Based on this insight, we propose a suite of novel techniques to build HDC models that use binary hypervectors of dimensions that are orders of magnitude smaller than those found in the state-of-the-art HDC models, yet yield equivalent or even improved accuracy and efficiency. For image classification, we achieved an HDC accuracy of 96.88\% with a dimension of only 32 on the MNIST dataset. We further explore our methods on more complex datasets like CIFAR-10 and show the limits of HDC computing.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Improving Graph Generation by Restricting Graph Bandwidth. (arXiv:2301.10857v1 [cs.LG])
    Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation is made available.  ( 2 min )
    A Practical Influence Approximation for Privacy-Preserving Data Filtering in Federated Learning. (arXiv:2205.11518v2 [cs.CR] UPDATED)
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation (`lazy' influence) that can be implemented in a privacy-preserving manner. Each participant uses his own data to evaluate the influence of another participant's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for highly effective filtering of corrupted data in a variety of applications. Importantly, we show that most of the corrupted data can be filtered out (recall of $>90\%$, and even up to $100\%$), even under really strong privacy guarantees ($\varepsilon \leq 1$).  ( 2 min )
    Qualitative Analysis of a Graph Transformer Approach to Addressing Hate Speech: Adapting to Dynamically Changing Content. (arXiv:2301.10871v1 [cs.LG])
    Our work advances an approach for predicting hate speech in social media, drawing out the critical need to consider the discussions that follow a post to successfully detect when hateful discourse may arise. Using graph transformer networks, coupled with modelling attention and BERT-level natural language processing, our approach can capture context and anticipate upcoming anti-social behaviour. In this paper, we offer a detailed qualitative analysis of this solution for hate speech detection in social networks, leading to insights into where the method has the most impressive outcomes in comparison with competitors and identifying scenarios where there are challenges to achieving ideal performance. Included is an exploration of the kinds of posts that permeate social media today, including the use of hateful images. This suggests avenues for extending our model to be more comprehensive. A key insight is that the focus on reasoning about the concept of context positions us well to be able to support multi-modal analysis of online posts. We conclude with a reflection on how the problem we are addressing relates especially well to the theme of dynamic change, a critical concern for all AI solutions for social impact. We also comment briefly on how mental health well-being can be advanced with our work, through curated content attuned to the extent of hate in posts.  ( 2 min )
    Counterfactual Analysis in Dynamic Latent State Models. (arXiv:2205.13832v2 [cs.LG] UPDATED)
    We provide an optimization-based framework to perform counterfactual analysis in a dynamic model with hidden states. Our framework is grounded in the "abduction, action, and prediction" approach to answer counterfactual queries and handles two key challenges where (1) the states are hidden and (2) the model is dynamic. Recognizing the lack of knowledge on the underlying causal mechanism and the possibility of infinitely many such mechanisms, we optimize over this space and compute upper and lower bounds on the counterfactual quantity of interest. Our work brings together ideas from causality, state-space models, simulation, and optimization, and we apply it on a breast cancer case study. To the best of our knowledge, we are the first to compute lower and upper bounds on a counterfactual query in a dynamic latent-state model.  ( 2 min )
    Assistive Recipe Editing through Critiquing. (arXiv:2205.02454v2 [cs.CL] UPDATED)
    There has recently been growing interest in the automatic generation of cooking recipes that satisfy some form of dietary restrictions, thanks in part to the availability of online recipe data. Prior studies have used pre-trained language models, or relied on small paired recipe data (e.g., a recipe paired with a similar one that satisfies a dietary constraint). However, pre-trained language models generate inconsistent or incoherent recipes, and paired datasets are not available at scale. We address these deficiencies with RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work's main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users' feedback. Experiments on the Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Stochastic Online Fisher Markets: Static Pricing Limits and Adaptive Enhancements. (arXiv:2205.00825v3 [cs.GT] UPDATED)
    In a Fisher market, agents (users) spend a budget of (artificial) currency to buy goods that maximize their utilities while a central planner sets prices on capacity-constrained goods such that the market clears. However, the efficacy of pricing schemes in achieving an equilibrium outcome in Fisher markets typically relies on complete knowledge of users' budgets and utilities and requires that transactions happen in a static market wherein all users are present simultaneously. As a result, we study an online variant of Fisher markets, wherein budget-constrained users with privately known utility and budget parameters, drawn i.i.d. from a distribution $\mathcal{D}$, enter the market sequentially. In this setting, we develop an algorithm that adjusts prices solely based on observations of user consumption, i.e., revealed preference feedback, and achieves a regret and capacity violation of $O(\sqrt{n})$, where $n$ is the number of users and the good capacities scale as $O(n)$. Here, our regret measure is the optimality gap in the objective of the Eisenberg-Gale program between an online algorithm and an offline oracle with complete information on users' budgets and utilities. To establish the efficacy of our approach, we show that any uniform (static) pricing algorithm, including one that sets expected equilibrium prices with complete knowledge of the distribution $\mathcal{D}$, cannot achieve both a regret and constraint violation of less than $\Omega(\sqrt{n})$. While our revealed preference algorithm requires no knowledge of the distribution $\mathcal{D}$, we show that if $\mathcal{D}$ is known, then an adaptive variant of expected equilibrium pricing achieves $O(\log(n))$ regret and constant capacity violation for discrete distributions. Finally, we present numerical experiments to demonstrate the performance of our revealed preference algorithm relative to several benchmarks.  ( 3 min )
    Revisiting the Adversarial Robustness-Accuracy Tradeoff in Robot Learning. (arXiv:2204.07373v2 [cs.RO] UPDATED)
    Adversarial training (i.e., training on adversarially perturbed input data) is a well-studied method for making neural networks robust to potential adversarial attacks during inference. However, the improved robustness does not come for free but rather is accompanied by a decrease in overall model accuracy and performance. Recent work has shown that, in practical robot learning applications, the effects of adversarial training do not pose a fair trade-off but inflict a net loss when measured in holistic robot performance. This work revisits the robustness-accuracy trade-off in robot learning by systematically analyzing if recent advances in robust training methods and theory in conjunction with adversarial robot learning, are capable of making adversarial training suitable for real-world robot applications. We evaluate three different robot learning tasks ranging from autonomous driving in a high-fidelity environment amenable to sim-to-real deployment to mobile robot navigation and gesture recognition. Our results demonstrate that, while these techniques make incremental improvements on the trade-off on a relative scale, the negative impact on the nominal accuracy caused by adversarial training still outweighs the improved robustness by an order of magnitude. We conclude that although progress is happening, further advances in robust learning methods are necessary before they can benefit robot learning tasks in practice.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v4 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    A Light-weight Deep Human Activity Recognition Algorithm Using Multi-knowledge Distillation. (arXiv:2107.07331v4 [cs.LG] UPDATED)
    Inertial sensor-based human activity recognition (HAR) is the base of many human-centered mobile applications. Deep learning-based fine-grained HAR models enable accurate classification in various complex application scenarios. Nevertheless, the large storage and computational overhead of the existing fine-grained deep HAR models hinder their widespread deployment on resource-limited platforms. Inspired by the knowledge distillation's reasonable model compression and potential performance improvement capability, we design a multi-level HAR modeling pipeline called Stage-Logits-Memory Distillation (SMLDist) based on the widely-used MobileNet. By paying more attention to the frequency-related features during the distillation process, the SMLDist improves the HAR classification robustness of the students. We also propose an auto-search mechanism in the heterogeneous classifiers to improve classification performance. Extensive simulation results demonstrate that SMLDist outperforms various state-of-the-art HAR frameworks in accuracy and F1 macro score. The practical evaluation of the Jetson Xavier AGX platform shows that the SMLDist model is both energy-efficient and computation-efficient. These experiments validate the reasonable balance between the robustness and efficiency of the proposed model. The comparative experiments of knowledge distillation on six public datasets also demonstrate that the SMLDist outperforms other advanced knowledge distillation methods of students' performance, which verifies the good generalization of the SMLDist on other classification tasks, including but not limited to HAR.  ( 2 min )
    Textual Explanations and Critiques in Recommendation Systems. (arXiv:2205.07268v2 [cs.LG] UPDATED)
    Artificial intelligence and machine learning algorithms have become ubiquitous. Although they offer a wide range of benefits, their adoption in decision-critical fields is limited by their lack of interpretability, particularly with textual data. Moreover, with more data available than ever before, it has become increasingly important to explain automated predictions. Generally, users find it difficult to understand the underlying computational processes and interact with the models, especially when the models fail to generate the outcomes or explanations, or both, correctly. This problem highlights the growing need for users to better understand the models' inner workings and gain control over their actions. This dissertation focuses on two fundamental challenges of addressing this need. The first involves explanation generation: inferring high-quality explanations from text documents in a scalable and data-driven manner. The second challenge consists in making explanations actionable, and we refer to it as critiquing. This dissertation examines two important applications in natural language processing and recommendation tasks. Overall, we demonstrate that interpretability does not come at the cost of reduced performance in two consequential applications. Our framework is applicable to other fields as well. This dissertation presents an effective means of closing the gap between promise and practice in artificial intelligence.  ( 2 min )
    RoFL: Robustness of Secure Federated Learning. (arXiv:2107.03311v4 [cs.CR] UPDATED)
    Even though recent years have seen many attacks exposing severe vulnerabilities in Federated Learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking. In this work, we demystify the inner workings of existing (targeted) attacks. We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity. This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity. We show that certain classes of severe attacks can be mitigated effectively by enforcing constraints such as norm bounds on clients' updates. We investigate how to efficiently incorporate these constraints into secure FL protocols in the single-server setting. Based on this, we propose RoFL, a new secure FL system that extends secure aggregation with privacy-preserving input validation. Specifically, RoFL can enforce constraints such as $L_2$ and $L_\infty$ bounds on high-dimensional encrypted model updates.  ( 2 min )
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v3 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation. (arXiv:2202.11205v5 [cs.DS] UPDATED)
    We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix $M_\mathsf{count}$ and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the {\em completely bounded norm} (cb-norm) of $M_\mathsf{count}$. Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of $M_\mathsf{count}$ for a large range of the dimension of $M_\mathsf{count}$. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning {and the first lower bounds on the additive error for $(\epsilon,\delta)$-differentially-private counting under continual observation.} Subsequent to this work, Henzinger et al. (SODA2023) showed that our factorization also achieves fine-grained mean-squared error.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Learning to Act Safely with Limited Exposure and Almost Sure Certainty. (arXiv:2105.08748v3 [eess.SY] UPDATED)
    This paper puts forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials. This is indeed possible, provided that one is willing to navigate trade-offs between optimality, level of exposure to unsafe events, and the maximum detection time of unsafe actions. We illustrate this concept in two complementary settings. We first focus on the canonical multi-armed bandit problem and study the intrinsic trade-offs of learning safety in the presence of uncertainty. Under mild assumptions on sufficient exploration, we provide an algorithm that provably detects all unsafe machines in an (expected) finite number of rounds. The analysis also unveils a trade-off between the number of rounds needed to secure the environment and the probability of discarding safe machines. We then consider the problem of finding optimal policies for a Markov Decision Process (MDP) with almost sure constraints. We show that the action-value function satisfies a barrier-based decomposition which allows for the identification of feasible policies independently of the reward process. Using this decomposition, we develop a Barrier-learning algorithm, that identifies such unsafe state-action pairs in a finite expected number of steps. Our analysis further highlights a trade-off between the time lag for the underlying MDP necessary to detect unsafe actions, and the level of exposure to unsafe events. Simulations corroborate our theoretical findings, further illustrating the aforementioned trade-offs, and suggesting that safety constraints can speed up the learning process.  ( 2 min )
    Joint Training of Deep Ensembles Fails Due to Learner Collusion. (arXiv:2301.11323v1 [cs.LG])
    Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance. In the case of deep ensembles of neural networks, we are provided with the opportunity to directly optimize the true objective: the joint performance of the ensemble as a whole. Surprisingly, however, directly minimizing the loss of the ensemble appears to rarely be applied in practice. Instead, most previous research trains individual models independently with ensembling performed post hoc. In this work, we show that this is for good reason - joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to demonstrate the practical implications of this effect finding that, in some cases, a balance between independent training and joint optimization can improve performance over the former while avoiding the degeneracies of the latter.  ( 2 min )
    A Context-based Multi-task Hierarchical Inverse Reinforcement Learning Algorithm. (arXiv:2210.01969v2 [cs.LG] UPDATED)
    Multi-task Imitation Learning (MIL) aims to train a policy capable of performing a distribution of tasks, which is essential for general-purpose robots, based on multi-task expert demonstrations. Existing MIL algorithms suffer from low data efficiency and poor performance on complex long-horizontal tasks. We develop Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL) to learn hierarchically-structured multi-task policies, which is more beneficial for compositional tasks with long horizons and has higher expert data efficiency through identifying and transferring reusable basic skills across tasks. To realize this, MH-AIRL effectively synthesizes context-based multi-task learning, AIRL (an IL approach), and hierarchical policy learning. Further, MH-AIRL can be adopted to demonstrations without the task or skill annotations (i.e., state-action pairs only) which are more accessible in practice. Theoretical justifications are provided for each module of MH-AIRL, and evaluations on challenging multi-task settings demonstrate superior performance and transferability of the multi-task policies learned with MH-AIRL as compared to SOTA MIL baselines.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Recursive deep learning framework for forecasting the decadal world economic outlook. (arXiv:2301.10874v1 [cs.LG])
    Gross domestic product (GDP) is the most widely used indicator in macroeconomics and the main tool for measuring a country's economic ouput. Due to the diversity and complexity of the world economy, a wide range of models have been used, but there are challenges in making decadal GDP forecasts given unexpected changes such as pandemics and wars. Deep learning models are well suited for modeling temporal sequences have been applied for time series forecasting. In this paper, we develop a deep learning framework to forecast the GDP growth rate of the world economy over a decade. We use Penn World Table as the source of our data, taking data from 1980 to 2019, across 13 countries, such as Australia, China, India, the United States and so on. We test multiple deep learning models, LSTM, BD-LSTM, ED-LSTM and CNN, and compared their results with the traditional time series model (ARIMA,VAR). Our results indicate that ED-LSTM is the best performing model. We present a recursive deep learning framework to predict the GDP growth rate in the next ten years. We predict that most countries will experience economic growth slowdown, stagnation or even recession within five years; only China, France and India are predicted to experience stable, or increasing, GDP growth.  ( 2 min )
    Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection. (arXiv:2211.09858v2 [cs.SD] UPDATED)
    Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.  ( 2 min )
    Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild. (arXiv:2208.07522v5 [cs.LG] UPDATED)
    Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Since moderation policies vary depending on countries and types of products, it is common to train and deploy the models per policy. However, this approach is highly inefficient, especially when the policies change, requiring dataset re-labeling and model re-training on the shifted data distribution. To alleviate this cost inefficiency, social media platforms often employ third-party content moderation services that provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons, instead of directly providing final moderation decisions. However, making a reliable automated moderation decision from the prediction scores of the multiple subtasks for a specific target policy has not been widely explored yet. In this study, we formulate real-world scenarios of content moderation and introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way. Extensive experiments demonstrate that our approach shows better performance in content moderation compared to existing threshold optimization methods and heuristics.  ( 2 min )
    Iterative Teaching by Label Synthesis. (arXiv:2110.14432v5 [cs.LG] UPDATED)
    In this paper, we consider the problem of iterative machine teaching, where a teacher provides examples sequentially based on the current iterative learner. In contrast to previous methods that have to scan over the entire pool and select teaching examples from it in each iteration, we propose a label synthesis teaching framework where the teacher randomly selects input teaching examples (e.g., images) and then synthesizes suitable outputs (e.g., labels) for them. We show that this framework can avoid costly example selection while still provably achieving exponential teachability. We propose multiple novel teaching algorithms in this framework. Finally, we empirically demonstrate the value of our framework.  ( 2 min )
    VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media. (arXiv:2208.09021v3 [cs.CV] UPDATED)
    We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight and novelty of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT can yield relative improvements up to 20% over ViLT and achieve state-of-the-art or comparable performance on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Characterizing the Influence of Graph Elements. (arXiv:2210.07441v2 [cs.LG] UPDATED)
    Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes in model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of an SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.  ( 2 min )
    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. (arXiv:2210.13382v3 [cs.LG] UPDATED)
    Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.  ( 2 min )
    Review of Natural Language Processing in Pharmacology. (arXiv:2208.10228v2 [cs.CL] UPDATED)
    Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.  ( 2 min )
    Cut and Learn for Unsupervised Object Detection and Instance Segmentation. (arXiv:2301.11320v1 [cs.CV])
    We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks using our robust loss function. We further improve the performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Certified Interpretability Robustness for Class Activation Mapping. (arXiv:2301.11324v1 [cs.LG])
    Interpreting machine learning models is challenging but crucial for ensuring the safety of deep networks in autonomous driving systems. Due to the prevalence of deep learning based perception models in autonomous vehicles, accurately interpreting their predictions is crucial. While a variety of such methods have been proposed, most are shown to lack robustness. Yet, little has been done to provide certificates for interpretability robustness. Taking a step in this direction, we present CORGI, short for Certifiably prOvable Robustness Guarantees for Interpretability mapping. CORGI is an algorithm that takes in an input image and gives a certifiable lower bound for the robustness of the top k pixels of its CAM interpretability map. We show the effectiveness of CORGI via a case study on traffic sign data, certifying lower bounds on the minimum adversarial perturbation not far from (4-5x) state-of-the-art attack methods.  ( 2 min )
    Predictive Crypto-Asset Automated Market Making Architecture for Decentralized Finance using Deep Reinforcement Learning. (arXiv:2211.01346v2 [q-fin.TR] UPDATED)
    The study proposes a quote-driven predictive automated market maker (AMM) platform with on-chain custody and settlement functions, alongside off-chain predictive reinforcement learning capabilities to improve liquidity provision of real-world AMMs. The proposed AMM architecture is an augmentation to the Uniswap V3, a cryptocurrency AMM protocol, by utilizing a novel market equilibrium pricing for reduced divergence and slippage loss. Further, the proposed architecture involves a predictive AMM capability, utilizing a deep hybrid Long Short-Term Memory (LSTM) and Q-learning reinforcement learning framework that looks to improve market efficiency through better forecasts of liquidity concentration ranges, so liquidity starts moving to expected concentration ranges, prior to asset price movement, so that liquidity utilization is improved. The augmented protocol framework is expected have practical real-world implications, by (i) reducing divergence loss for liquidity providers, (ii) reducing slippage for crypto-asset traders, while (iii) improving capital efficiency for liquidity provision for the AMM protocol. To our best knowledge, there are no known protocol or literature that are proposing similar deep learning-augmented AMM that achieves similar capital efficiency and loss minimization objectives for practical real-world applications.  ( 2 min )
    What you need to know to train recurrent neural networks to make Flip Flops memories and more. (arXiv:2010.07858v3 [cs.LG] UPDATED)
    Training neural networks to perform different tasks is relevant across various disciplines beyond Machine Learning. In particular, Recurrent Neural Networks (RNNs) are of great interest to different scientific communities. Open-source frameworks dedicated to Machine Learning, such as Tensorflow [1] and Keras [2] have produced significant changes in the development of technologies that we currently use. One relevant problem that can be approached with them is how to build the models to study dynamical systems and the brain. Specifically, how to extract the relevant information to answer the scientific questions of interest. The purpose of the present work is to contribute to this aim by analyzing a temporal processing task, in this case, a 3-bit Flip Flop memory. The modelling procedure in every step is shown: from equations to the software development. The networks obtained were analyzed to describe the dynamics and to show different visualization and analysis tools. The code developed in this premier is also provided to be used for modelling other tasks or systems.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning. (arXiv:2301.11321v1 [cs.LG])
    Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.  ( 2 min )
    Anatomy-aware and acquisition-agnostic joint registration with SynthMorph. (arXiv:2301.11329v1 [eess.IV])
    Affine image registration is a cornerstone of medical-image processing and analysis. While classical algorithms can achieve excellent accuracy, they solve a time-consuming optimization for every new image pair. Deep-learning (DL) methods learn a function that maps an image pair to an output transform. Evaluating the functions is fast, but capturing large transforms can be challenging, and networks tend to struggle if a test-image characteristic shifts from the training domain, such as the contrast or resolution. A majority of affine methods are also agnostic to the anatomy the user wishes to align; the registration will be inaccurate if algorithms consider all structures in the image. We address these shortcomings with a fast, robust, and easy-to-use DL tool for affine and deformable registration of any brain image without preprocessing, right off the MRI scanner. First, we rigorously analyze how competing architectures learn affine transforms across a diverse set of neuroimaging data, aiming to truly capture the behavior of methods in the real world. Second, we leverage a recent strategy to train networks with wildly varying images synthesized from label maps, yielding robust performance across acquisition specifics. Third, we optimize the spatial overlap of select anatomical labels, which enables networks to distinguish between anatomy of interest and irrelevant structures, removing the need for preprocessing that excludes content that would otherwise reduce the accuracy of anatomy-specific registration. We combine the affine model with prior work on deformable registration and test brain-specific registration across a landscape of MRI protocols unseen at training, demonstrating consistent and improved accuracy compared to existing tools. We distribute our code and tool at https://w3id.org/synthmorph, providing a single complete end-to-end solution for registration of brain MRI.  ( 3 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    ZiCo: Zero-shot NAS via Inverse Coefficient of Variation on Gradients. (arXiv:2301.11300v1 [cs.LG])
    Neural Architecture Search (NAS) is widely used to automatically design the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate and generalization capacity of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS, TransNASBench-101) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs on ImageNet within 0.4 GPU days.  ( 2 min )
    Open Problems in Applied Deep Learning. (arXiv:2301.11316v1 [cs.LG])
    This work formulates the machine learning mechanism as a bi-level optimization problem. The inner level optimization loop entails minimizing a properly chosen loss function evaluated on the training data. This is nothing but the well-studied training process in pursuit of optimal model parameters. The outer level optimization loop is less well-studied and involves maximizing a properly chosen performance metric evaluated on the validation data. This is what we call the "iteration process", pursuing optimal model hyper-parameters. Among many other degrees of freedom, this process entails model engineering (e.g., neural network architecture design) and management, experiment tracking, dataset versioning and augmentation. The iteration process could be automated via Automatic Machine Learning (AutoML) or left to the intuitions of machine learning students, engineers, and researchers. Regardless of the route we take, there is a need to reduce the computational cost of the iteration step and as a direct consequence reduce the carbon footprint of developing artificial intelligence algorithms. Despite the clean and unified mathematical formulation of the iteration step as a bi-level optimization problem, its solutions are case specific and complex. This work will consider such cases while increasing the level of complexity from supervised learning to semi-supervised, self-supervised, unsupervised, few-shot, federated, reinforcement, and physics-informed learning. As a consequence of this exercise, this proposal surfaces a plethora of open problems in the field, many of which can be addressed in parallel.  ( 2 min )
    BayesSpeech: A Bayesian Transformer Network for Automatic Speech Recognition. (arXiv:2301.11276v1 [eess.AS])
    Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks. These models tend to be lighter weight and require less training time than traditional RNN-based approaches. However, these models take frequentist approach to weight training. In theory, network weights are drawn from a latent, intractable probability distribution. We introduce BayesSpeech for end-to-end Automatic Speech Recognition. BayesSpeech is a Bayesian Transformer Network where these intractable posteriors are learned through variational inference and the local reparameterization trick without recurrence. We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Understanding Finetuning for Factual Knowledge Extraction from Language Models. (arXiv:2301.11293v1 [cs.CL])
    Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.  ( 2 min )
    Text-To-4D Dynamic Scene Generation. (arXiv:2301.11280v1 [cs.CV])
    We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.  ( 2 min )
    Classification of vertices on social networks by multiple approaches. (arXiv:2301.11288v1 [cs.SI])
    Due to the advent of the expressions of data other than tabular formats, the topological compositions which make samples interrelated came into prominence. Analogically, those networks can be interpreted as social connections, dataflow maps, citation influence graphs, protein bindings, etc. However, in the case of social networks, it is highly crucial to evaluate the labels of discrete communities. The reason underneath for such a study is the non-negligible importance of analyzing graph networks to partition the vertices by using the topological features of network graphs, solely. For each of these interaction-based entities, a social graph, a mailing dataset, and two citation sets are selected as the testbench repositories. This paper, it was not only assessed the most valuable method but also determined how graph neural networks work and the need to improve against non-neural network approaches which are faster and computationally cost-effective. Also, this paper showed a limit to be excesses by prospective graph neural network variations by using the topological features of networks trialed.  ( 2 min )
    Gaussian process regression and conditional Karhunen-Lo\'{e}ve models for data assimilation in inverse problems. (arXiv:2301.11279v1 [cs.LG])
    We present a model inversion algorithm, CKLEMAP, for data assimilation and parameter estimation in partial differential equation models of physical systems with spatially heterogeneous parameter fields. These fields are approximated using low-dimensional conditional Karhunen-Lo\'{e}ve expansions, which are constructed using Gaussian process regression models of these fields trained on the parameters' measurements. We then assimilate measurements of the state of the system and compute the maximum a posteriori estimate of the CKLE coefficients by solving a nonlinear least-squares problem. When solving this optimization problem, we efficiently compute the Jacobian of the vector objective by exploiting the sparsity structure of the linear system of equations associated with the forward solution of the physics problem. The CKLEMAP method provides better scalability compared to the standard MAP method. In the MAP method, the number of unknowns to be estimated is equal to the number of elements in the numerical forward model. On the other hand, in CKLEMAP, the number of unknowns (CKLE coefficients) is controlled by the smoothness of the parameter field and the number of measurements, and is in general much smaller than the number of discretization nodes, which leads to a significant reduction of computational cost with respect to the standard MAP method. To show its advantage in scalability, we apply CKLEMAP to estimate the transmissivity field in a two-dimensional steady-state subsurface flow model of the Hanford Site by assimilating synthetic measurements of transmissivity and hydraulic head. We find that the execution time of CKLEMAP scales nearly linearly as $N^{1.33}$, where $N$ is the number of discretization nodes, while the execution time of standard MAP scales as $N^{2.91}$. The CKLEMAP method improved execution time without sacrificing accuracy when compared to the standard MAP.  ( 3 min )
    Real-Time Digital Twins: Vision and Research Directions for 6G and Beyond. (arXiv:2301.11283v1 [eess.SP])
    This article presents a vision where \textit{real-time} digital twins of the physical wireless environments are continuously updated using multi-modal sensing data from the distributed infrastructure and user devices, and are used to make communication and sensing decisions. This vision is mainly enabled by the advances in precise 3D maps, multi-modal sensing, ray-tracing computations, and machine/deep learning. This article details this vision, explains the different approaches for constructing and utilizing these real-time digital twins, discusses the applications and open problems, and presents a research platform that can be used to investigate various digital twin research directions.  ( 2 min )
    Online Convex Optimization with Stochastic Constraints: Zero Constraint Violation and Bandit Feedback. (arXiv:2301.11267v1 [math.OC])
    This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.  ( 2 min )
    AlignGraph: A Group of Generative Models for Graphs. (arXiv:2301.11273v1 [cs.SI])
    It is challenging for generative models to learn a distribution over graphs because of the lack of permutation invariance: nodes may be ordered arbitrarily across graphs, and standard graph alignment is combinatorial and notoriously expensive. We propose AlignGraph, a group of generative models that combine fast and efficient graph alignment methods with a family of deep generative models that are invariant to node permutations. Our experiments demonstrate that our framework successfully learns graph distributions, outperforming competitors by 25% -560% in relevant performance scores.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    A Benchmark Study by using various Machine Learning Models for Predicting Covid-19 trends. (arXiv:2301.11257v1 [cs.LG])
    Machine learning and deep learning play vital roles in predicting diseases in the medical field. Machine learning algorithms are widely classified as supervised, unsupervised, and reinforcement learning. This paper contains a detailed description of our experimental research work in that we used a supervised machine-learning algorithm to build our model for outbreaks of the novel Coronavirus that has spread over the whole world and caused many deaths, which is one of the most disastrous Pandemics in the history of the world. The people suffered physically and economically to survive in this lockdown. This work aims to understand better how machine learning, ensemble, and deep learning models work and are implemented in the real dataset. In our work, we are going to analyze the current trend or pattern of the coronavirus and then predict the further future of the covid-19 confirmed cases or new cases by training the past Covid-19 dataset by using the machine learning algorithm such as Linear Regression, Polynomial Regression, K-nearest neighbor, Decision Tree, Support Vector Machine and Random forest algorithm are used to train the model. The decision tree and the Random Forest algorithm perform better than SVR in this work. The performance of SVR and lasso regression are low in all prediction areas Because the SVR is challenging to separate the data using the hyperplane for this type of problem. So SVR mostly gives a lower performance in this problem. Ensemble (Voting, Bagging, and Stacking) and deep learning models(ANN) also predict well. After the prediction, we evaluated the model using MAE, MSE, RMSE, and MAPE. This work aims to find the trend/pattern of the covid-19.  ( 3 min )
    Molecular Language Model as Multi-task Generator. (arXiv:2301.11259v1 [cs.LG])
    Molecule generation with desired properties has grown immensely in popularity by disruptively changing the way scientists design molecular structures and providing support for chemical and materials design. However, despite the promising outcome, previous machine learning-based deep generative models suffer from a reliance on complex, task-specific fine-tuning, limited dimensional latent spaces, or the quality of expert rules. In this work, we propose MolGen, a pre-trained molecular language model that effectively learns and shares knowledge across multiple generation tasks and domains. Specifically, we pre-train MolGen with the chemical language SELFIES on more than 100 million unlabelled molecules. We further propose multi-task molecular prefix tuning across several molecular generation tasks and different molecular domains (synthetic & natural products) with a self-feedback mechanism. Extensive experiments show that MolGen can obtain superior performances on well-known molecular generation benchmark datasets. The further analysis illustrates that MolGen can accurately capture the distribution of molecules, implicitly learn their structural characteristics, and efficiently explore the chemical space with the guidance of multi-task molecular prefix tuning. Codes, datasets, and the pre-trained model will be available in https://github.com/zjunlp/MolGen.  ( 2 min )
    BiBench: Benchmarking and Analyzing Network Binarization. (arXiv:2301.11233v1 [cs.CV])
    Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research.  ( 2 min )
    Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data. (arXiv:2301.11174v1 [cs.CV])
    We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in challenging scenarios including out-of-task data (i.e., relational captioning, where the target task is different from the unpaired data) and web-crawled data. We also show that our proposed method is theoretically well-motivated and has a favorable global optimal property. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired COCO dataset demonstrate the consistent effectiveness of our semisupervised learning method with unpaired data compared to competing methods.  ( 2 min )
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2301.11197v1 [cs.LG])
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    A Graph Neural Network with Negative Message Passing for Graph Coloring. (arXiv:2301.11164v1 [cs.LG])
    Graph neural networks have received increased attention over the past years due to their promising ability to handle graph-structured data, which can be found in many real-world problems such as recommended systems and drug synthesis. Most existing research focuses on using graph neural networks to solve homophilous problems, but little attention has been paid to heterophily-type problems. In this paper, we propose a graph network model for graph coloring, which is a class of representative heterophilous problems. Different from the conventional graph networks, we introduce negative message passing into the proposed graph neural network for more effective information exchange in handling graph coloring problems. Moreover, a new loss function taking into account the self-information of the nodes is suggested to accelerate the learning process. Experimental studies are carried out to compare the proposed graph model with five state-of-the-art algorithms on ten publicly available graph coloring problems and one real-world application. Numerical results demonstrate the effectiveness of the proposed graph neural network.  ( 2 min )
    Deep Laplacian-based Options for Temporally-Extended Exploration. (arXiv:2301.11181v1 [cs.LG])
    Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.  ( 2 min )
    Flex-Net: A Graph Neural Network Approach to Resource Management in Flexible Duplex Networks. (arXiv:2301.11166v1 [cs.NI])
    Flexible duplex networks allow users to dynamically employ uplink and downlink channels without static time scheduling, thereby utilizing the network resources efficiently. This work investigates the sum-rate maximization of flexible duplex networks. In particular, we consider a network with pairwise-fixed communication links. Corresponding combinatorial optimization is a non-deterministic polynomial (NP)-hard without a closed-form solution. In this respect, the existing heuristics entail high computational complexity, raising a scalability issue in large networks. Motivated by the recent success of Graph Neural Networks (GNNs) in solving NP-hard wireless resource management problems, we propose a novel GNN architecture, named Flex-Net, to jointly optimize the communication direction and transmission power. The proposed GNN produces near-optimal performance meanwhile maintaining a low computational complexity compared to the most commonly used techniques. Furthermore, our numerical results shed light on the advantages of using GNNs in terms of sample complexity, scalability, and generalization capability.  ( 2 min )
    Train Hard, Fight Easy: Robust Meta Reinforcement Learning. (arXiv:2301.11147v1 [cs.LG])
    A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we propose a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL often leads to both biased gradients and data inefficiency. We prove that the former disappears in MRL, and address the latter via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML learns substantially different meta-policies and achieves robust returns on several navigation and continuous control benchmarks.  ( 2 min )
    Which Experiences Are Influential for Your Agent? Policy Iteration with Turn-over Dropout. (arXiv:2301.11168v1 [cs.LG])
    In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about the influence is valuable for various purposes, including experience cleansing and analysis. One method for estimating the influence of individual experiences is agent comparison, but it is prohibitively expensive when there is a large number of experiences. In this paper, we present PI+ToD as a method for efficiently estimating the influence of experiences. PI+ToD is a policy iteration that efficiently estimates the influence of experiences by utilizing turn-over dropout. We demonstrate the efficiency of PI+ToD with experiments in MuJoCo environments.  ( 2 min )
    Convolutional Learning on Simplicial Complexes. (arXiv:2301.11163v1 [cs.LG])
    We propose a simplicial complex convolutional neural network (SCCNN) to learn data representations on simplicial complexes. It performs convolutions based on the multi-hop simplicial adjacencies via common faces and cofaces independently and captures the inter-simplicial couplings, generalizing state-of-the-art. Upon studying symmetries of the simplicial domain and the data space, it is shown to be permutation and orientation equivariant, thus, incorporating such inductive biases. Based on the Hodge theory, we perform a spectral analysis to understand how SCCNNs regulate data in different frequencies, showing that the convolutions via faces and cofaces operate in two orthogonal data spaces. Lastly, we study the stability of SCCNNs to domain deformations and examine the effects of various factors. Empirical results show the benefits of higher-order convolutions and inter-simplicial couplings in simplex prediction and trajectory prediction.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Federated Learning over Coupled Graphs. (arXiv:2301.11099v1 [cs.LG])
    Graphs are widely used to represent the relations among entities. When one owns the complete data, an entire graph can be easily built, therefore performing analysis on the graph is straightforward. However, in many scenarios, it is impractical to centralize the data due to data privacy concerns. An organization or party only keeps a part of the whole graph data, i.e., graph data is isolated from different parties. Recently, Federated Learning (FL) has been proposed to solve the data isolation issue, mainly for Euclidean data. It is still a challenge to apply FL on graph data because graphs contain topological information which is notorious for its non-IID nature and is hard to partition. In this work, we propose a novel FL framework for graph data, FedCog, to efficiently handle coupled graphs that are a kind of distributed graph data, but widely exist in a variety of real-world applications such as mobile carriers' communication networks and banks' transaction networks. We theoretically prove the correctness and security of FedCog. Experimental results demonstrate that our method FedCog significantly outperforms traditional FL methods on graphs. Remarkably, our FedCog improves the accuracy of node classification tasks by up to 14.7%.  ( 2 min )
    FedHQL: Federated Heterogeneous Q-Learning. (arXiv:2301.11135v1 [cs.LG])
    Federated Reinforcement Learning (FedRL) encourages distributed agents to learn collectively from each other's experience to improve their performance without exchanging their raw trajectories. The existing work on FedRL assumes that all participating agents are homogeneous, which requires all agents to share the same policy parameterization (e.g., network architectures and training configurations). However, in real-world applications, agents are often in disagreement about the architecture and the parameters, possibly also because of disparate computational budgets. Because homogeneity is not given in practice, we introduce the problem setting of Federated Reinforcement Learning with Heterogeneous And bLack-box agEnts (FedRL-HALE). We present the unique challenges this new setting poses and propose the Federated Heterogeneous Q-Learning (FedHQL) algorithm that principally addresses these challenges. We empirically demonstrate the efficacy of FedHQL in boosting the sample efficiency of heterogeneous agents with distinct policy parameterization using standard RL tasks.  ( 2 min )
    Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning. (arXiv:2301.11153v1 [cs.LG])
    Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.  ( 2 min )
    SQ Lower Bounds for Random Sparse Planted Vector Problem. (arXiv:2301.11124v1 [cs.LG])
    Consider the setting where a $\rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $n\geq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $n\ll \rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $n\ll \rho^2 d^{2}$ and $\rho\gg \frac{1}{\sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minerva: A File-Based Ransomware Detector. (arXiv:2301.11050v1 [cs.CR])
    Ransomware is a rapidly evolving type of malware designed to encrypt user files on a device, making them inaccessible in order to exact a ransom. Ransomware attacks resulted in billions of dollars in damages in recent years and are expected to cause hundreds of billions more in the next decade. With current state-of-the-art process-based detectors being heavily susceptible to evasion attacks, no comprehensive solution to this problem is available today. This paper presents Minerva, a new approach to ransomware detection. Unlike current methods focused on identifying ransomware based on process-level behavioral modeling, Minerva detects ransomware by building behavioral profiles of files based on all the operations they receive in a time window. Minerva addresses some of the critical challenges associated with process-based approaches, specifically their vulnerability to complex evasion attacks. Our evaluation of Minerva demonstrates its effectiveness in detecting ransomware attacks, including those that are able to bypass existing defenses. Our results show that Minerva identifies ransomware activity with an average accuracy of 99.45% and an average recall of 99.66%, with 99.97% of ransomware detected within 1 second.  ( 2 min )
    Incomplete Multi-view Clustering via Prototype-based Imputation. (arXiv:2301.11045v1 [cs.LG])
    In this paper, we study how to achieve two characteristics highly-expected by incomplete multi-view clustering (IMvC). Namely, i) instance commonality refers to that within-cluster instances should share a common pattern, and ii) view versatility refers to that cross-view samples should own view-specific patterns. To this end, we design a novel dual-stream model which employs a dual attention layer and a dual contrastive learning loss to learn view-specific prototypes and model the sample-prototype relationship. When the view is missed, our model performs data recovery using the prototypes in the missing view and the sample-prototype relationship inherited from the observed view. Thanks to our dual-stream model, both cluster- and view-specific information could be captured, and thus the instance commonality and view versatility could be preserved to facilitate IMvC. Extensive experiments demonstrate the superiority of our method on six challenging benchmarks compared with 11 approaches. The code will be released.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices. (arXiv:2301.10999v1 [cs.LG])
    The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Multi-Agent congestion cost minimization with linear function approximation. (arXiv:2301.10993v1 [cs.LG])
    This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.  ( 2 min )
    Time-sensitive Learning for Heterogeneous Federated Edge Intelligence. (arXiv:2301.10977v1 [cs.LG])
    Real-time machine learning has recently attracted significant interest due to its potential to support instantaneous learning, adaptation, and decision making in a wide range of application domains, including self-driving vehicles, intelligent transportation, and industry automation. We investigate real-time ML in a federated edge intelligence (FEI) system, an edge computing system that implements federated learning (FL) solutions based on data samples collected and uploaded from decentralized data networks. FEI systems often exhibit heterogenous communication and computational resource distribution, as well as non-i.i.d. data samples, resulting in long model training time and inefficient resource utilization. Motivated by this fact, we propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model. Training acceleration solutions for both TS-FL with synchronous coordination (TS-FL-SC) and asynchronous coordination (TS-FL-ASC) are investigated. To address straggler effect in TS-FL-SC, we develop an analytical solution to characterize the impact of selecting different subsets of edge servers on the overall model training time. A server dropping-based solution is proposed to allow slow-performance edge servers to be removed from participating in model training if their impact on the resulting model accuracy is limited. A joint optimization algorithm is proposed to minimize the overall time consumption of model training by selecting participating edge servers, local epoch number. We develop an analytical expression to characterize the impact of staleness effect of asynchronous coordination and straggler effect of FL on the time consumption of TS-FL-ASC. Experimental results show that TS-FL-SC and TS-FL-ASC can provide up to 63% and 28% of reduction, in the overall model training time, respectively.  ( 2 min )
    A Fully First-Order Method for Stochastic Bilevel Optimization. (arXiv:2301.10945v1 [math.OC])
    We consider stochastic unconstrained bilevel optimization problems when only the first-order gradient oracles are available. While numerous optimization methods have been proposed for tackling bilevel problems, existing methods either tend to require possibly expensive calculations regarding Hessians of lower-level objectives, or lack rigorous finite-time performance guarantees. In this work, we propose a Fully First-order Stochastic Approximation (F2SA) method, and study its non-asymptotic convergence properties. Specifically, we show that F2SA converges to an $\epsilon$-stationary solution of the bilevel problem after $\epsilon^{-7/2}, \epsilon^{-5/2}$, and $\epsilon^{-3/2}$ iterations (each iteration using $O(1)$ samples) when stochastic noises are in both level objectives, only in the upper-level objective, and not present (deterministic settings), respectively. We further show that if we employ momentum-assisted gradient estimators, the iteration complexities can be improved to $\epsilon^{-5/2}, \epsilon^{-4/2}$, and $\epsilon^{-3/2}$, respectively. We demonstrate even superior practical performance of the proposed method over existing second-order based approaches on MNIST data-hypercleaning experiments.  ( 2 min )
    Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure. (arXiv:2301.10956v1 [cs.LG])
    Graph Neural Networks (GNNs) are popular models for graph learning problems. GNNs show strong empirical performance in many practical tasks. However, the theoretical properties have not been completely elucidated. In this paper, we investigate whether GNNs can exploit the graph structure from the perspective of the expressive power of GNNs. In our analysis, we consider graph generation processes that are controlled by hidden node features, which contain all information about the graph structure. A typical example of this framework is kNN graphs constructed from the hidden features. In our main results, we show that GNNs can recover the hidden node features from the input graph alone, even when all node features, including the hidden features themselves and any indirect hints, are unavailable. GNNs can further use the recovered node features for downstream tasks. These results show that GNNs can fully exploit the graph structure by themselves, and in effect, GNNs can use both the hidden and explicit node features for downstream tasks. In the experiments, we confirm the validity of our results by showing that GNNs can accurately recover the hidden features using a GNN architecture built based on our theoretical analysis.  ( 2 min )
    Visiting Distant Neighbors in Graph Convolutional Networks. (arXiv:2301.10960v1 [cs.LG])
    We extend the graph convolutional network method for deep learning on graph data to higher order in terms of neighboring nodes. In order to construct representations for a node in a graph, in addition to the features of the node and its immediate neighboring nodes, we also include more distant nodes in the calculations. In experimenting with a number of publicly available citation graph datasets, we show that this higher order neighbor visiting pays off by outperforming the original model especially when we have a limited number of available labeled data points for the training of the model.  ( 2 min )
    Privacy-Preserving Joint Edge Association and Power Optimization for the Internet of Vehicles via Federated Multi-Agent Reinforcement Learning. (arXiv:2301.11014v1 [cs.LG])
    Proactive edge association is capable of improving wireless connectivity at the cost of increased handover (HO) frequency and energy consumption, while relying on a large amount of private information sharing required for decision making. In order to improve the connectivity-cost trade-off without privacy leakage, we investigate the privacy-preserving joint edge association and power allocation (JEAPA) problem in the face of the environmental uncertainty and the infeasibility of individual learning. Upon modelling the problem by a decentralized partially observable Markov Decision Process (Dec-POMDP), it is solved by federated multi-agent reinforcement learning (FMARL) through only sharing encrypted training data for federatively learning the policy sought. Our simulation results show that the proposed solution strikes a compelling trade-off, while preserving a higher privacy level than the state-of-the-art solutions.  ( 2 min )
    Neural Dynamic Focused Topic Model. (arXiv:2301.10988v1 [cs.CL])
    Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster. Source code to reproduce our experiments is available online.  ( 2 min )
    On the Importance of Noise Scheduling for Diffusion Models. (arXiv:2301.10972v1 [cs.CV])
    We empirically study the effect of noise scheduling strategies for denoising diffusion generative models. There are three findings: (1) the noise scheduling is crucial for the performance, and the optimal one depends on the task (e.g., image sizes), (2) when increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels), and (3) simply scaling the input data by a factor of $b$ while keeping the noise schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a good strategy across image sizes. This simple recipe, when combined with recently proposed Recurrent Interface Network (RIN), yields state-of-the-art pixel-based diffusion models for high-resolution images on ImageNet, enabling single-stage, end-to-end generation of diverse and high-fidelity images at 1024$\times$1024 resolution for the first time (without upsampling/cascades).  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    SparDA: Accelerating Dynamic Sparse Deep Neural Networks via Sparse-Dense Transformation. (arXiv:2301.10936v1 [cs.LG])
    Due to its high cost-effectiveness, sparsity has become the most important approach for building efficient deep-learning models. However, commodity accelerators are built mainly for efficient dense computation, creating a huge gap for general sparse computation to leverage. Existing solutions have to use time-consuming compiling to improve the efficiency of sparse kernels in an ahead-of-time manner and thus are limited to static sparsity. A wide range of dynamic sparsity opportunities is missed because their sparsity patterns are only known at runtime. This limits the future of building more biological brain-like neural networks that should be dynamically and sparsely activated. In this paper, we bridge the gap between sparse computation and commodity accelerators by proposing a system, called Spider, for efficiently executing deep learning models with dynamic sparsity. We identify an important property called permutation invariant that applies to most deep-learning computations. The property enables Spider (1) to extract dynamic sparsity patterns of tensors that are only known at runtime with little overhead; and (2) to transform the dynamic sparse computation into an equivalent dense computation which has been extremely optimized on commodity accelerators. Extensive evaluation on diverse models shows Spider can extract and transform dynamic sparsity with negligible overhead but brings up to 9.4x speedup over state-of-art solutions.  ( 2 min )
    Affective Faces for Goal-Driven Dyadic Communication. (arXiv:2301.10939v1 [cs.CV])
    We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.  ( 2 min )
    Super-Resolution Analysis via Machine Learning: A Survey for Fluid Flows. (arXiv:2301.10937v1 [physics.flu-dyn])
    This paper surveys machine-learning-based super-resolution reconstruction for vortical flows. Super resolution aims to find the high-resolution flow fields from low-resolution data and is generally an approach used in image reconstruction. In addition to surveying a variety of recent super-resolution applications, we provide case studies of super-resolution analysis for an example of two-dimensional decaying isotropic turbulence. We demonstrate that physics-inspired model designs enable successful reconstruction of vortical flows from spatially limited measurements. We also discuss the challenges and outlooks of machine-learning-based super-resolution analysis for fluid flow applications. The insights gained from this study can be leveraged for super-resolution analysis of numerical and experimental flow data.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Efficient Trust Region-Based Safe Reinforcement Learning with Low-Bias Distributional Actor-Critic. (arXiv:2301.10923v1 [cs.LG])
    To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.  ( 2 min )
    Partial advantage estimator for proximal policy optimization. (arXiv:2301.10920v1 [cs.LG])
    Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to $\lambda$-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and $\mu$RTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.  ( 2 min )
    SuperFed: Weight Shared Federated Learning. (arXiv:2301.10879v1 [cs.LG])
    Federated Learning (FL) is a well-established technique for privacy preserving distributed training. Much attention has been given to various aspects of FL training. A growing number of applications that consume FL-trained models, however, increasingly operate under dynamically and unpredictably variable conditions, rendering a single model insufficient. We argue for training a global family of models cost efficiently in a federated fashion. Training them independently for different tradeoff points incurs $O(k)$ cost for any k architectures of interest, however. Straightforward applications of FL techniques to recent weight-shared training approaches is either infeasible or prohibitively expensive. We propose SuperFed - an architectural framework that incurs $O(1)$ cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning. We achieve an order of magnitude cost savings on both communication and computation by proposing two novel training mechanisms: (a) distribution of weight-shared models to federated clients, (b) central aggregation of arbitrarily overlapping weight-shared model parameters. The combination of these mechanisms is shown to reach an order of magnitude (9.43x) reduction in computation and communication cost for training a $5*10^{18}$-sized family of models, compared to independently training as few as $k = 9$ DNNs without any accuracy loss.  ( 2 min )
    Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation. (arXiv:2301.10814v1 [q-bio.BM])
    Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Prior work focused on supervised learning methods using a large set of binding affinity data for small molecules, but it is hard to apply the same strategy to other drug classes like antibodies as labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) score matching. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. We evaluate NERE on protein-ligand and antibody-antigen binding affinity prediction benchmarks. Our model outperforms all unsupervised baselines (physics-based and statistical potentials) and matches supervised learning methods in the antibody case.  ( 2 min )
    Joint action loss for proximal policy optimization. (arXiv:2301.10919v1 [cs.LG])
    PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.  ( 2 min )
    Distilling Cognitive Backdoor Patterns within an Image. (arXiv:2301.10908v1 [cs.LG])
    This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.  ( 2 min )
    Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL). (arXiv:2301.10888v1 [cs.LG])
    Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different imbalanced rates from different domains. Applied 6 data augmentation methods with 11 machine learning methods testing if the data augmentation will help with improving machine learning performance. We compared the traditional data augmentation evaluation methods with our proposed cross-validation evaluation framework Results Using traditional data augmentation evaluation meta hods will give a false impression of improving the performance. However, our proposed evaluation method shows data augmentation has limited ability to improve the results. Conclusion EFIDL is more suitable for evaluating the prediction performance of an ML method when data are augmented. Using an unsuitable evaluation framework will give false results. Future researchers should consider the evaluation framework we proposed when dealing with augmented datasets. Our experiments showed data augmentation does not help improve ML prediction performance.  ( 2 min )
    GPU-based Private Information Retrieval for On-Device Machine Learning Inference. (arXiv:2301.10904v1 [cs.CR])
    On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our co-design techniques obtain over $5 \times$ additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a naively implemented system -- while maintaining model accuracy, and limiting inference communication and response latency to within $300$KB and $<100$ms respectively.  ( 2 min )
    Learning Gradients of Convex Functions with Monotone Gradient Networks. (arXiv:2301.10862v1 [cs.LG])
    While much effort has been devoted to deriving and studying effective convex formulations of signal processing problems, the gradients of convex functions also have critical applications ranging from gradient-based optimization to optimal transport. Recent works have explored data-driven methods for learning convex objectives, but learning their monotone gradients is seldom studied. In this work, we propose Cascaded and Modular Monotone Gradient Networks (C-MGN and M-MGN respectively), two monotone gradient neural network architectures for directly learning the gradients of convex functions. We show that our networks are simpler to train, learn monotone gradient fields more accurately, and use significantly fewer parameters than state of the art methods. We further demonstrate their ability to learn optimal transport mappings to augment driving image data.  ( 2 min )
    Reef-insight: A framework for reef habitat mapping with clustering methods via remote sensing. (arXiv:2301.10876v1 [cs.LG])
    Environmental damage has been of much concern, particularly coastal areas and the oceans given climate change and drastic effects of pollution and extreme climate events. Our present day analytical capabilities along with the advancements in information acquisition techniques such as remote sensing can be utilized for the management and study of coral reef ecosystems. In this paper, we present Reef-insight, an unsupervised machine learning framework that features advanced clustering methods and remote sensing for reef community mapping. Our framework compares different clustering methods to evaluate them for reef community mapping using remote sensing data. We evaluate four major clustering approaches such as k- means, hierarchical clustering, Gaussian mixture model, and density-based clustering based on qualitative and visual assessment. We utilise remote sensing data featuring Heron reef island region in the Great Barrier Reef of Australia. Our results indicate that clustering methods using remote sensing data can well identify benthic and geomorphic clusters that are found in reefs when compared to other studies. Our results indicate that Reef-insight can generate detailed reef community maps outlining distinct reef habitats and has the potential to enable further insights for reef restoration projects. We release our framework as open source software to enable its extension to different parts of the world  ( 2 min )
    Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram. (arXiv:2301.10856v1 [cs.CY])
    In response to disinformation and propaganda from Russian online media following the Russian invasion of Ukraine, Russian outlets including Russia Today and Sputnik News were banned throughout Europe. Many of these Russian outlets, in order to reach their audiences, began to heavily promote their content on messaging services like Telegram. In this work, to understand this phenomenon, we study how 16 Russian media outlets have interacted with and utilized 732 Telegram channels throughout 2022. To do this, we utilize a multilingual version of the foundational model MPNet to embed articles and Telegram messages in a shared embedding space and semantically compare content. Leveraging a parallelized version of DP-Means clustering, we perform paragraph-level topic/narrative extraction and time-series analysis with Hawkes Processes. With this approach, across our websites, we find between 2.3% (ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from activity on Telegram. Finally, tracking the spread of individual narratives, we measure the rate at which these websites and channels disseminate content within the Russian media ecosystem.  ( 2 min )
    Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning. (arXiv:2301.10886v1 [cs.LG])
    We present AIRS: Automatic Intrinsic Reward Shaping that intelligently and adaptively provides high-quality intrinsic rewards to enhance exploration in reinforcement learning (RL). More specifically, AIRS selects shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem. Moreover, we develop an intrinsic reward toolkit to provide efficient and reliable implementations of diverse intrinsic reward approaches. We test AIRS on various tasks of Procgen games and DeepMind Control Suite. Extensive simulation demonstrates that AIRS can outperform the benchmarking schemes and achieve superior performance with simple architecture.  ( 2 min )
    When Layers Play the Lottery, all Tickets Win at Initialization. (arXiv:2301.10835v1 [cs.LG])
    Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.  ( 2 min )
    RobustPdM: Designing Robust Predictive Maintenance against Adversarial Attacks. (arXiv:2301.10822v1 [cs.CR])
    The state-of-the-art predictive maintenance (PdM) techniques have shown great success in reducing maintenance costs and downtime of complicated machines while increasing overall productivity through extensive utilization of Internet-of-Things (IoT) and Deep Learning (DL). Unfortunately, IoT sensors and DL algorithms are both prone to cyber-attacks. For instance, DL algorithms are known for their susceptibility to adversarial examples. Such adversarial attacks are vastly under-explored in the PdM domain. This is because the adversarial attacks in the computer vision domain for classification tasks cannot be directly applied to the PdM domain for multivariate time series (MTS) regression tasks. In this work, we propose an end-to-end methodology to design adversarially robust PdM systems by extensively analyzing the effect of different types of adversarial attacks and proposing a novel adversarial defense technique for DL-enabled PdM models. First, we propose novel MTS Projected Gradient Descent (PGD) and MTS PGD with random restarts (PGD_r) attacks. Then, we evaluate the impact of MTS PGD and PGD_r along with MTS Fast Gradient Sign Method (FGSM) and MTS Basic Iterative Method (BIM) on Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and Bi-directional LSTM based PdM system. Our results using NASA's turbofan engine dataset show that adversarial attacks can cause a severe defect (up to 11X) in the RUL prediction, outperforming the effectiveness of the state-of-the-art PdM attacks by 3X. Furthermore, we present a novel approximate adversarial training method to defend against adversarial attacks. We observe that approximate adversarial training can significantly improve the robustness of PdM models (up to 54X) and outperforms the state-of-the-art PdM defense methods by offering 3X more robustness.  ( 2 min )
    Salesforce CausalAI Library: A Fast and Scalable Framework for Causal Analysis of Time Series and Tabular Data. (arXiv:2301.10859v1 [cs.LG])
    We introduce the Salesforce CausalAI Library, an open-source library for causal analysis using observational data. It supports causal discovery and causal inference for tabular and time series data, of both discrete and continuous types. This library includes algorithms that handle linear and non-linear causal relationships between variables, and uses multi-processing for speed-up. We also include a data generator capable of generating synthetic data with specified structural equation model for both the aforementioned data formats and types, that helps users control the ground-truth causal process while investigating various algorithms. Finally, we provide a user interface (UI) that allows users to perform causal analysis on data without coding. The goal of this library is to provide a fast and flexible solution for a variety of problems in the domain of causality. This technical report describes the Salesforce CausalAI API along with its capabilities, the implementations of the supported algorithms, and experiments demonstrating their performance and speed. Our library is available at \url{https://github.com/salesforce/causalai}.  ( 2 min )
    Improved Bitcoin Price Prediction based on COVID-19 data. (arXiv:2301.10840v1 [cs.LG])
    Social turbulence can affect people financial decisions, causing changes in spending and saving. During a global turbulence as significant as the COVID-19 pandemic, such changes are inevitable. Here we examine how the effects of COVID-19 on various jurisdictions influenced the global price of Bitcoin. We hypothesize that lock downs and expectations of economic recession erode people trust in fiat (government-issued) currencies, thus elevating cryptocurrencies. Hence, we expect to identify a causal relation between the turbulence caused by the pandemic, demand for Bitcoin, and ultimately its price. To test the hypothesis, we merged datasets of Bitcoin prices and COVID-19 cases and deaths. We also engineered extra features and applied statistical and machine learning (ML) models. We applied a Random Forest model (RF) to identify and rank the feature importance, and ran a Long Short-Term Memory (LSTM) model on Bitcoin prices data set twice: with and without accounting for COVID-19 related features. We find that adding COVID-19 data into the LSTM model improved prediction of Bitcoin prices.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    Unravelling physics beyond the standard model with classical and quantum anomaly detection. (arXiv:2301.10787v1 [hep-ex])
    Much hope for finding new physics phenomena at microscopic scale relies on the observations obtained from High Energy Physics experiments, like the ones performed at the Large Hadron Collider (LHC). However, current experiments do not indicate clear signs of new physics that could guide the development of additional Beyond Standard Model (BSM) theories. Identifying signatures of new physics out of the enormous amount of data produced at the LHC falls into the class of anomaly detection and constitutes one of the greatest computational challenges. In this article, we propose a novel strategy to perform anomaly detection in a supervised learning setting, based on the artificial creation of anomalies through a random process. For the resulting supervised learning problem, we successfully apply classical and quantum Support Vector Classifiers (CSVC and QSVC respectively) to identify the artificial anomalies among the SM events. Even more promising, we find that employing an SVC trained to identify the artificial anomalies, it is possible to identify realistic BSM events with high accuracy. In parallel, we also explore the potential of quantum algorithms for improving the classification accuracy and provide plausible conditions for the best exploitation of this novel computational paradigm.  ( 2 min )
    Increasing Fairness in Compromise on Accuracy via Weighted Vote with Learning Guarantees. (arXiv:2301.10813v1 [cs.LG])
    As the bias issue is being taken more and more seriously in widely applied machine learning systems, the decrease in accuracy in most cases deeply disturbs researchers when increasing fairness. To address this problem, we present a novel analysis of the expected fairness quality via weighted vote, suitable for both binary and multi-class classification. The analysis takes the correction of biased predictions by ensemble members into account and provides learning bounds that are amenable to efficient minimisation. We further propose a pruning method based on this analysis and the concepts of domination and Pareto optimality, which is able to increase fairness under a prerequisite of little or even no accuracy decline. The experimental results indicate that the proposed learning bounds are faithful and that the proposed pruning method can indeed increase ensemble fairness without much accuracy degradation.  ( 2 min )
    Gene-SGAN: a method for discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering. (arXiv:2301.10772v1 [q-bio.QM])
    Disease heterogeneity has been a critical challenge for precision diagnosis and treatment, especially in neurologic and neuropsychiatric diseases. Many diseases can display multiple distinct brain phenotypes across individuals, potentially reflecting disease subtypes that can be captured using MRI and machine learning methods. However, biological interpretability and treatment relevance are limited if the derived subtypes are not associated with genetic drivers or susceptibility factors. Herein, we describe Gene-SGAN - a multi-view, weakly-supervised deep clustering method - which dissects disease heterogeneity by jointly considering phenotypic and genetic data, thereby conferring genetic correlations to the disease subtypes and associated endophenotypic signatures. We first validate the generalizability, interpretability, and robustness of Gene-SGAN in semi-synthetic experiments. We then demonstrate its application to real multi-site datasets from 28,858 individuals, deriving subtypes of Alzheimer's disease and brain endophenotypes associated with hypertension, from MRI and SNP data. Derived brain phenotypes displayed significant differences in neuroanatomical patterns, genetic determinants, biological and clinical biomarkers, indicating potentially distinct underlying neuropathologic processes, genetic drivers, and susceptibility factors. Overall, Gene-SGAN is broadly applicable to disease subtyping and endophenotype discovery, and is herein tested on disease-related, genetically-driven neuroimaging phenotypes.  ( 2 min )
    Quantum anomaly detection in the latent space of proton collision events at the LHC. (arXiv:2301.10780v1 [quant-ph])
    We propose a new strategy for anomaly detection at the LHC based on unsupervised quantum machine learning algorithms. To accommodate the constraints on the problem size dictated by the limitations of current quantum hardware we develop a classical convolutional autoencoder. The designed quantum anomaly detection models, namely an unsupervised kernel machine and two clustering algorithms, are trained to find new-physics events in the latent representation of LHC data produced by the autoencoder. The performance of the quantum algorithms is benchmarked against classical counterparts on different new-physics scenarios and its dependence on the dimensionality of the latent space and the size of the training dataset is studied. For kernel-based anomaly detection, we identify a regime where the quantum model significantly outperforms its classical counterpart. An instance of the kernel machine is implemented on a quantum computer to verify its suitability for available hardware. We demonstrate that the observed consistent performance advantage is related to the inherent quantum properties of the circuit used.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )
    Generative Tertiary Structure-based RNA Design. (arXiv:2301.10774v1 [q-bio.BM])
    Learning from 3D biological macromolecules with artificial intelligence technologies has been an emerging area. Computational protein design, known as the inverse of protein structure prediction, aims to generate protein sequences that will fold into the defined structure. Analogous to protein design, RNA design is also an important topic in synthetic biology, which aims to generate RNA sequences by given structures. However, existing RNA design methods mainly focus on the secondary structure, ignoring the informative tertiary structure, which is commonly used in protein design. To explore the complex coupling between RNA sequence and 3D structure, we introduce an RNA tertiary structure modeling method to efficiently capture useful information from the 3D structure of RNA. For a fair comparison, we collect abundant RNA data and split the data according to tertiary structures. With the standard dataset, we conduct a benchmark by employing structure-based protein design approaches with our RNA tertiary structure modeling method. We believe our work will stimulate the future development of tertiary structure-based RNA design and bridge the gap between the RNA 3D structures and sequences.  ( 2 min )
  • Open

    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Two-step interpretable modeling of Intensive Care Acquired Infections. (arXiv:2301.11146v1 [stat.AP])
    We present a novel methodology for integrating high resolution longitudinal data with the dynamic prediction capabilities of survival models. The aim is two-fold: to improve the predictive power while maintaining interpretability of the models. To go beyond the black box paradigm of artificial neural networks, we propose a parsimonious and robust semi-parametric approach (i.e., a landmarking competing risks model) that combines routinely collected low-resolution data with predictive features extracted from a convolutional neural network, that was trained on high resolution time-dependent information. We then use saliency maps to analyze and explain the extra predictive power of this model. To illustrate our methodology, we focus on healthcare-associated infections in patients admitted to an intensive care unit.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Causal Inference with Hidden Mediators. (arXiv:2111.02927v2 [math.ST] UPDATED)
    Proximal causal inference was recently proposed as a framework to identify causal effects from observational data in the presence of hidden confounders for which proxies are available. In this paper, we extend the proximal causal inference approach to settings where identification of causal effects hinges upon a set of mediators which are not observed, yet error prone proxies of the hidden mediators are measured. Specifically, (i) We establish causal hidden mediation analysis, which extends classical causal mediation analysis methods for identifying natural direct and indirect effects under no unmeasured confounding to a setting where the mediator of interest is hidden, but proxies of it are available. (ii) We establish hidden front-door criterion, which extends the classical front-door criterion to allow for hidden mediators for which proxies are available. (iii) We show that the identification of a certain causal effect called population intervention indirect effect remains possible with hidden mediators in settings where challenges in (i) and (ii) might co-exist. We view (i)-(iii) as important steps towards the practical application of front-door criteria and mediation analysis as mediators are almost always measured with error and thus, the most one can hope for in practice is that the measurements are at best proxies of mediating mechanisms. We propose identification approaches for the parameters of interest in our considered models. For the estimation aspect, we propose an influence function-based estimation method and provide an analysis for the robustness of the estimators.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    On the Dissipation of Ideal Hamiltonian Monte Carlo Sampler. (arXiv:2209.07438v2 [stat.CO] UPDATED)
    We report on what seems to be an intriguing connection between variable integration time and partial velocity refreshment of Ideal Hamiltonian Monte Carlo samplers, both of which can be used for reducing the dissipative behavior of the dynamics. More concretely, we show that on quadratic potentials, efficiency can be improved through these means by a $\sqrt{\kappa}$ factor in Wasserstein-2 distance, compared to classical constant integration time, fully refreshed HMC. We additionally explore the benefit of randomized integrators for simulating the Hamiltonian dynamics under higher order regularity conditions.  ( 2 min )
    Conformal Prediction for Trustworthy Detection of Railway Signals. (arXiv:2301.11136v1 [stat.ML])
    We present an application of conformal prediction, a form of uncertainty quantification with guarantees, to the detection of railway signals. State-of-the-art architectures are tested and the most promising one undergoes the process of conformalization, where a correction is applied to the predicted bounding boxes (i.e. to their height and width) such that they comply with a predefined probability of success. We work with a novel exploratory dataset of images taken from the perspective of a train operator, as a first step to build and validate future trustworthy machine learning models for the detection of railway signals.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Granger Causal Chain Discovery for Sepsis-Associated Derangements via Multivariate Hawkes Processes. (arXiv:2209.04480v4 [stat.AP] UPDATED)
    Modern health care systems are conducting continuous, automated surveillance of the electronic medical record (EMR) to identify adverse events with increasing frequency; however, many events such as sepsis do not have elucidated prodromes (i.e., event chains) that can be used to identify and intercept the adverse event early in its course. Currently, there does not exist reliable framework for discovering or describing causal chains that precede adverse hospital events. Clinically relevant and interpretable results require a framework that can (1) infer temporal interactions across multiple patient features found in EMR data (e.g., labs, vital signs, etc.) and (2) can identify patterns that precede and are specific to an impending adverse event (e.g., sepsis). In this work, we propose a linear multivariate Hawkes process model, coupled with ReLU link function, to recover a Granger Causal (GC) graph with both exciting and inhibiting effects. We develop a scalable two-phase gradient-based method to maximize a surrogate-likelihood and estimate the problem parameters, which is shown to be effective via extensive numerical simulation. Our method is subsequently extended to a data set of patients admitted to an academic level 1 trauma center located in Atalanta, GA, where the estimated GC graph identifies several highly interpretable chains that precede sepsis. Here, we demonstrate the effectiveness of our approach in learning a GC graph over Sepsis Associated Derangements (SADs), but it can be generalized to other applications with similar requirements.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minimax estimation of discontinuous optimal transport maps: The semi-discrete case. (arXiv:2301.11302v1 [math.ST])
    We consider the problem of estimating the optimal transport map between two probability distributions, $P$ and $Q$ in $\mathbb R^d$, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution $Q$ is a discrete measure supported on a finite number of points in $\mathbb R^d$. We study a computationally efficient estimator initially proposed by Pooladian and Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate $n^{-1/2}$, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    Learning from Mistakes: Self-Regularizing Hierarchical Semantic Representations in Point Cloud Segmentation. (arXiv:2301.11145v1 [cs.CV])
    Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )

  • Open

    ChatGPT can definitely print Russian propaganda including why Prime Minister Justin Trudeau should be charged with treason despite its Wikipedia page
    submitted by /u/Robinsonc1988 [link] [comments]  ( 40 min )
    Text-To-4D Dynamic Scene Generation
    submitted by /u/bperki8 [link] [comments]  ( 40 min )
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more!
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more! Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’re one of the first mobile multipurpose AI apps. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA(Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature, storylines, answer questions and more, (think of chatgpt). We’d love to have some people try it out, give us feedback, and keep in touch with us. https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/SonnyDoge22 [link] [comments]  ( 41 min )
    We need AI to take over all creative mediums and hobbies
    Before you quick draw shoot the messenger here, I want you to hear me out. This is coming from a person who enjoys drawing digital art and a person who used to make music for fun. Humans tend to be very selfish creatures, not in an evil sense, but rather a "phew, glad that was you and not me" sense, like if someone stepped in a pile of of dog poop that sucks for them, but it's not your problem. Once everybodies creative (not labor intenseive) livelyhoods get taken, then yeah, I could see how now AI replicating humanity could be a problem. Then maybe after that, people can enjoy not being replaced. That is all I have to say, god bless my reddit karma, I can feel this subreddit getting ready to dislike bomb this, but as long as this message gets out, that's all I want. submitted by /u/Zima_Re-L [link] [comments]  ( 41 min )
    AI Dream 150 - MY INCREDIBLE DREAM VISUALIZED BY AI - Part3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Humanity May Reach Singularity Within Just 7 Years, Trend Shows
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    VoiceGPT - ChatGPT Voice Assistant
    submitted by /u/nickbild [link] [comments]  ( 40 min )
    The Big Tech Royal Rumble for AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Google MusicLM turns language into music
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Looking to Convert Photos to Video - Anything Out There?
    I want to be able to upload several photos (10 - 100) and have the AI generate a video using those photos. I've seen Genmo, but that only uses one photo at a time. Is there anything that may be able to do this? Thanks submitted by /u/venicerocco [link] [comments]  ( 40 min )
    Freaky A.I concept..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    fully AI made video i found on yt
    https://www.youtube.com/watch?v=Vw-t826JcDQ submitted by /u/Optimal_Studio_2050 [link] [comments]  ( 40 min )
    Is there any AI that is able to decipher the lyrics of a song?
    Specifically this one: https://www.youtube.com/watch?v=MFv7apjatwM&ab_channel=Lux-Topic If there is no current AI that is able to listen to a song and write down the lyrics accurately, then I provide this idea freely. submitted by /u/A_Very_Horny_Zed [link] [comments]  ( 40 min )
    📌[Searchcolab] Voice Cloning + Image Processing + Lip Syncing. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Pix2Pix AI Model Inside Stable Diffusion Installation Guide!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    What will happen to the internet once ChatGPT disintermediates websites?
    If ChatGPT and AI search engines become the central place for acquiring knowledge, essentially scraping and disintermediating websites; what commercial incentive will many website owners have to generate new content once their ad revenue is gone? What will happen to the internet? submitted by /u/DelPrive235 [link] [comments]  ( 41 min )
    🧬 The Age of A.I., Longevity and Biotech - Is a "Synthetic biology singularity" coming?
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    Outsmarting AI Detection Tools: How to Make Your AI-Generated Content Fly Under the Radar
    submitted by /u/PapaDudu [link] [comments]  ( 43 min )
    Don’t Chat With ChatGPT: Amazon’s Warning To Employees
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    what is the best way to upscale comics? BSRGAN doesn't work well with text.
    submitted by /u/mhczbnoykrqvzazfth [link] [comments]  ( 41 min )
    📌[Searchcolab] Text-To-4D Dynamic Scene Generation.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Why is AI better at visual art than music generation?
    I've just been wondering why people have been able to develop AI models such as GANs and more recently diffusion models that are so good at creating images but music generation has remained somewhat stagnant. From what I understand, there are two ways AI are generally trained on music: through MIDI files and using raw audio. The MIDI files are probably easier for the AI to work with but there's massive loss of relevant data that an AI would need to really become proficient in music generation. On the other hand, AIs trained on the raw audio tend to produce somewhat fuzzy audio that often includes what sounds like nearly distinguishable lyrics. I'm personally convinced the latter is the method that will be universally used in the future, but it clearly has a long way to go. So does anyone have any insights regarding why it's straggling a bit? Perhaps it's because software like Stable Diffusion are piggybacking off the image recognition advancements that have been in very high demand the past couple decades? Perhaps music is, in fact, a more complicated or more strict artform and our human minds are biased to think that these two tasks should be roughly equivalent in difficulty? It's certainly not due to a lack of data to train on, so maybe we just don't have a model that's really suited to analyzing and imitating music. submitted by /u/No-Phrase1116 [link] [comments]  ( 44 min )
  • Open

    MusicLM: Generating Music From Text
    submitted by /u/nickb [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
  • Open

    Pygame window does not closes and kernel freezes
    I was rendering 'MountainCar-v0' environment in "human" render_mode. When I call env.close(), my pygame window doesn't closes automatically and I need to Force Quit it, and unfortunately my kernel dies. I'm using Jupyter Notebook. Following are my versions: PC: Mac OS Ventura 13.1 - Python: Python 3.9.9 - Jupyter:jupyter 1.0.0 - gym: 0.26.2 - ipykernel: 6.6.0 - ipython: 7.30.1 ​ Please let me know how I can solve this issue, Thanks! submitted by /u/Character-Yellow-796 [link] [comments]  ( 41 min )
    Tuning hyperparamters in RL.
    Hello everyone. I have a question about hyperparameter tuning for PPO. usually, when we tune them ( using Optuna for example), each set of hyperparameters is tested for a small number of steps ( 150000 for example ) then we pick the ones that yielded the best reward. Yet, in the long run, the asymptotic convergence might not be reached by the best hyperparameters found during tuning but rather, other hyperparameters that performed worse during tuning are the ones that might results in better asymptotic convergence. Is there any way to overcome this issue and maybe find the best hyperparameters without having to test for a large number of steps. Note: I'm using a custom environment with a continuous action space and image observations. submitted by /u/Many_Reception_4921 [link] [comments]  ( 42 min )
    Model-based hierarchical reinforcement learning
    Hi, do you know papers that combine model-based and hierarchical reinforcement learning, where also the lower level is a model-based approach? I cannot find sufficient paper about it submitted by /u/aika98oe [link] [comments]  ( 42 min )
  • Open

    [D] Monitoring and Retraining Models with Label-Changing Interventions
    When a trained ML model is implemented to predict an adverse event, the user might take steps to avoid that event. The outcome of the user's actions could either be successful or failure. In training with strictly observational data, a typical confusion matrix contains: ŷ=0, y=0 -> True Negative ŷ=0, y=1 -> False Negative ŷ=1, y=1 -> True Positive ŷ=1, y=0 -> False Positive When using the model, some results get confounded if the user acts based on the predictions ŷ=0, y=0, a=0 -> True Negative, No Intervention ŷ=0, y=1, a=0 -> False Negative, No Intervention ŷ=1, y=1, a=1 -> True Positive, Failed Intervention ŷ=1, y=0, a=1 -> False Positive OR Successful Intervention Ignoring the possibility that the intervention caused the adverse event, the involvement of the user may lead to an increase in the number of false positives that are perceived. Continuous monitoring becomes difficult due to perceived faster degradation of the model. Furthermore, retraining the model in the future may be hindered by labels that do not accurately reflect the true values. One approach that I've been proposed is to make sure there is always a hold-out set. Allow some random records to get scores, but do never act on them. This gives both a monitoring and retraining dataset. Are there other solutions that people use here? I've found the papers below, but I cannot say that I completely understand how to practically implement them. Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions (https://arxiv.org/abs/2211.09781) Model updating after interventions paradoxically introduces bias (https://arxiv.org/abs/2010.11530) submitted by /u/waiting4omscs [link] [comments]  ( 43 min )
    [P] Using algorithms or models from papers for commercial use
    Hey! I am reading the GET3D paper by Nvidia. The paper is listed with the Nvidia license which states: 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. The Work or derivative works thereof may be used or intended for use by Nvidia or its affiliates commercially or non-commercially. As used herein, "non-commercially" means for research or evaluation purposes only and not for any direct or indirect monetary gain. Does it mean there is no commercial way of using the ideas in the paper? Is it possible to use the ideas from that paper or any other paper by Nvidia in some product? As the idea from the paper is only the tool or a part of the product but is not the product itself. submitted by /u/romantimm25 [link] [comments]  ( 44 min )
    [D] Google Predoctoral Program (India) 2023
    Has anyone got any interview email? Did they start the interview process? submitted by /u/Around-star [link] [comments]  ( 42 min )
    [D] ImageNet2012 Advice
    I'm currently at the point in my PhD career that I've developed some extremely successful components of CNNs, architecture, activation, etc. Outperforming default choices on CIFAR10, CIFAR100, Flowers, Caltech101, and other smaller datasets. With how success the results currently are, we want to publish to a top tier conference, specifically NeurIPS this Spring, deadline around May 13th. However, we (me and my advisor) agree that to publish at NeurIPS, our developments need to be backed up by ImageNet. The problem is that we have never trained on ImageNet before (so no experience), and have a limited computational budget. Although our university personally owns 2 A100 40 GB GPUs that we can use, they are shared within the entire university, so a 2 day job takes about 1 week in queue (don't know if we can get the results in time by May). On the other hand, we also don't know if we can get a $2500 grant in time to use cloud resources. For those who have trained on ImageNet, what are some common pitfalls, best ways to transfer data, downloading the dataset, etc? If you performed it on the cloud, how did you do so? How long was your time to train? Expenses? Did you run each model once or three times? Early stopping using validation or test set? NOTE: We will only be using Tensorflow... submitted by /u/MyActualUserName99 [link] [comments]  ( 44 min )
    [D] Best large language model for Named Entity Extraction?
    I'd like to extract named entities, something like this: "[Text]: Microsoft (the word being a portmanteau of "microcomputer software") was founded by Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. Steve Ballmer replaced Gates as CEO in 2000, and later envisioned a "devices and services" strategy. [Name]: Steve Ballmer [Position]: CEO [Company]: Microsoft " Tried it on GPT-Neox with 20b parameters with mixed success, is there anything better out there to try for a few-shot learning (without fine tuning)? submitted by /u/TankAttack [link] [comments]  ( 44 min )
    [R] ETLP: Event-based Three-factor Local Plasticity for online learning with neuromorphic hardware
    Neuromorphic perception with event-based sensors, asynchronous hardware and spiking neurons is showing promising results for real-time and energy-efficient inference in embedded systems. The next promise of brain-inspired computing is to enable adaptation to changes at the edge with online learning. However, the parallel and distributed architectures of neuromorphic hardware based on co-localized compute and memory imposes locality constraints to the on-chip learning rules. We propose in this work the Event-based Three-factor Local Plasticity (ETLP) rule that uses (1) the pre-synaptic spike trace, (2) the post-synaptic membrane voltage and (3) a third factor in the form of projected labels with no error calculation, that also serve as update triggers. We apply ETLP with feedforward and recurrent spiking neural networks on visual and auditory event-based pattern recognition, and compare it to Back-Propagation Through Time (BPTT) and eProp. We show a competitive performance in accuracy with a clear advantage in the computational complexity for ETLP. We also show that when using local plasticity, threshold adaptation in spiking neurons and a recurrent topology are necessary to learn spatio-temporal patterns with a rich temporal structure. Finally, we provide a proof of concept hardware implementation of ETLP on FPGA to highlight the simplicity of its computational primitives and how they can be mapped into neuromorphic hardware for online learning with low-energy consumption and real-time interaction. Full paper: https://arxiv.org/abs/2301.08281 submitted by /u/ferquinve [link] [comments]  ( 43 min )
    [D] MusicLM: Generating Music From Text
    How far do you think this can go? Is it a memorization machine or can it create new songs? https://google-research.github.io/seanet/musiclm/examples/ submitted by /u/carlthome [link] [comments]  ( 44 min )
    [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
    Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks, but are difficult to deploy because of their massive size and computational costs. For instance, the top-performing GPT-175B model has 175 billion parameters, which total at least 320GB (counting multiples of 1024) of storage in half-precision (FP16) format, leading it to require at least five A100 GPUs with 80GB of memory each for inference. It is therefore natural that there has been significant interest in reducing these costs via model compression. To date, virtually all existing GPT compression approaches have focused on quantization, that is, reducing the precision of the numerical representation of individual weights. A complementary approa…  ( 47 min )
    [D] Meta AI Residency 2023
    Now that apps are closed did anyone hear back yet? please follow this thread and update your status below, tbh I don't really think I have much of a chance but I'm excited none the less. ​ to follow a thread please press the bell on the top right submitted by /u/BeautyInUgly [link] [comments]  ( 42 min )
    [D] Moving away from Unicode for more equal token representation across global languages?
    Edit: as has been explained in the comments, unicode is not the issue so much as the byte-pair encoding scheme, which artificially limits the vocabulary size of the model and leads to less common language using more tokens. I'd like to discuss the impacts of increasing the vocabulary size on transformer model computational requirements. Many languages, like Chinese, Japanese Kanji, Korean, Telugu, etc use complex logograms to represent words and concepts. Unfortunately, these languages are severely "punished" in GPT3 because they are expensive to tokenize due to the way unicode represents them. Instead of unicode representing them as a single code point, logograms are typically represented as a sum of multiple graphemes, meaning that multiple unicode code points underlie their descriptio…  ( 49 min )
    [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ?
    I'm new to CTC. After learning about CTC and its application in End2End training for Speech Recognition, I figured that if we want to generate a target sequence (transcript), given a source sequence features, we could use the vanilla Encoder-Decoder architecture in Transformer (also used in T5, BART, etc) alone, without the need of CTC, yet why people are only using CTC for End2End Speech Recoginition, or using hybrid of CTC and Decoder in some papers ? Thanks. submitted by /u/KarmaCut132 [link] [comments]  ( 43 min )
  • Open

    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.  ( 3 min )
    The Devil is the Classifier: Investigating Long Tail Relation Classification with Decoupling Analysis. (arXiv:2009.07022v1 [cs.LG] CROSS LISTED)
    Long-tailed relation classification is a challenging problem as the head classes may dominate the training phase, thereby leading to the deterioration of the tail performance. Existing solutions usually address this issue via class-balancing strategies, e.g., data re-sampling and loss re-weighting, but all these methods adhere to the schema of entangling learning of the representation and classifier. In this study, we conduct an in-depth empirical investigation into the long-tailed problem and found that pre-trained models with instance-balanced sampling already capture the well-learned representations for all classes; moreover, it is possible to achieve better long-tailed classification ability at low cost by only adjusting the classifier. Inspired by this observation, we propose a robust classifier with attentive relation routing, which assigns soft weights by automatically aggregating the relations. Extensive experiments on two datasets demonstrate the effectiveness of our proposed approach. Code and datasets are available in https://github.com/zjunlp/deepke.  ( 2 min )
    Contrastive Triple Extraction with Generative Transformer. (arXiv:2009.06207v8 [cs.CL] CROSS LISTED)
    Triple extraction is an essential task in information extraction for natural language processing and knowledge graph construction. In this paper, we revisit the end-to-end triple extraction task for sequence generation. Since generative triple extraction may struggle to capture long-term dependencies and generate unfaithful triples, we introduce a novel model, contrastive triple extraction with a generative transformer. Specifically, we introduce a single shared transformer module for encoder-decoder-based generation. To generate faithful results, we propose a novel triplet contrastive training object. Moreover, we introduce two mechanisms to further improve model performance (i.e., batch-wise dynamic attention-masking and triple-wise calibration). Experimental results on three datasets (i.e., NYT, WebNLG, and MIE) show that our approach achieves better performance than that of baselines.  ( 2 min )
    Interaction Modeling with Multiplex Attention. (arXiv:2208.10660v2 [cs.LG] UPDATED)
    Modeling multi-agent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multi-agent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms state-of-the-art models in trajectory forecasting and relation inference, spanning three multi-agent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zero-shot generalization and allows us to probe how different interactions impact agent behavior.  ( 2 min )
    Deep learning in a bilateral brain with hemispheric specialization. (arXiv:2209.06862v6 [q-bio.NC] UPDATED)
    The brains of all bilaterally symmetric animals on Earth are divided into left and right hemispheres. The anatomy and functionality of the hemispheres have a large degree of overlap, but there are asymmetries and they specialize to possess different attributes. Several studies have used computational models to mimic hemispheric asymmetries with a focus on reproducing human data on semantic and visual processing tasks. In this study, we aimed to understand how dual hemispheres could interact in a given task. We propose a bilateral artificial neural network that imitates a lateralization observed in nature: that the left hemisphere specializes in specificity and the right in generalities. We used two ResNet-9 convolutional neural networks with different training objectives and tested it on an image classification task. Our analysis found that the hemispheres represent complementary features that are exploited by a network head which implements a type of weighted attention. The bilateral architecture outperformed a range of baselines of similar representational capacity that don't exploit differential specialization. The results demonstrate the efficacy of bilateralism, contribute to an understanding of bilateralism in biological brains and the architecture serves as an inductive bias when designing new AI systems.  ( 2 min )
    Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks. (arXiv:1903.01306v1 [cs.IR] CROSS LISTED)
    We propose a distance supervised relation extraction approach for long-tailed, imbalanced data which is prevalent in real-world settings. Here, the challenge is to learn accurate "few-shot" models for classes existing at the tail of the class distribution, for which little data is available. Inspired by the rich semantic correlations between classes at the long tail and those at the head, we take advantage of the knowledge from data-rich classes at the head of the distribution to boost the performance of the data-poor classes at the tail. First, we propose to leverage implicit relational knowledge among class labels from knowledge graph embeddings and learn explicit relational knowledge using graph convolution networks. Second, we integrate that relational knowledge into relation extraction model by coarse-to-fine knowledge-aware attention mechanism. We demonstrate our results for a large-scale benchmark dataset which show that our approach significantly outperforms other baselines, especially for long-tail relations.  ( 2 min )
    Towards Dynamic Stability Assessment of Power Grid Topologies using Graph Neural Networks. (arXiv:2206.06369v3 [cs.LG] UPDATED)
    To mitigate climate change, the share of renewable energies in power production needs to be increased. Renewables introduce new challenges to power grids regarding the dynamic stability due to decentralization, reduced inertia and volatility in production. Since dynamic stability simulations are intractable and exceedingly expensive for large grids, graph neural networks (GNNs) are a promising method to reduce the computational effort of analyzing dynamic stability of power grids. We provide new datasets of dynamic stability of synthetic power grids and find that GNNs are surprisingly effective at predicting the highly non-linear targets from topological information only. Furthermore, we use GNNs to demonstrate the accurate identification of particularly vulnerable nodes in power grids, so-called troublemakers. Lastly, we find that GNNs trained on small grids generate accurate predictions on a large synthetic model of the Texan power grid, which illustrates the potential for real-world applications of the presented approach.  ( 2 min )
    Why the pseudo label based semi-supervised learning algorithm is effective?. (arXiv:2211.10039v2 [cs.LG] UPDATED)
    Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are N labeled data. (2) There are N unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.  ( 2 min )
    ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning. (arXiv:2102.12828v3 [cs.CL] CROSS LISTED)
    This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches.  ( 2 min )
    Ultra-NeRF: Neural Radiance Fields for Ultrasound Imaging. (arXiv:2301.10520v1 [eess.IV])
    We present a physics-enhanced implicit neural representation (INR) for ultrasound (US) imaging that learns tissue properties from overlapping US sweeps. Our proposed method leverages a ray-tracing-based neural rendering for novel view US synthesis. Recent publications demonstrated that INR models could encode a representation of a three-dimensional scene from a set of two-dimensional US frames. However, these models fail to consider the view-dependent changes in appearance and geometry intrinsic to US imaging. In our work, we discuss direction-dependent changes in the scene and show that a physics-inspired rendering improves the fidelity of US image synthesis. In particular, we demonstrate experimentally that our proposed method generates geometrically accurate B-mode images for regions with ambiguous representation owing to view-dependent differences of the US images. We conduct our experiments using simulated B-mode US sweeps of the liver and acquired US sweeps of a spine phantom tracked with a robotic arm. The experiments corroborate that our method generates US frames that enable consistent volume compounding from previously unseen views. To the best of our knowledge, the presented work is the first to address view-dependent US image synthesis using INR.
    Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. (arXiv:2108.13161v7 [cs.CL] CROSS LISTED)
    Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.
    PromptKG: A Prompt Learning Framework for Knowledge Graph Representation Learning and Application. (arXiv:2210.00305v1 [cs.CL] CROSS LISTED)
    Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information. KG representation models should consider graph structures and text semantics, but no comprehensive open-sourced framework is mainly designed for KG regarding informative text description. In this paper, we present PromptKG, a prompt learning framework for KG representation learning and application that equips the cutting-edge text-based methods, integrates a new prompt learning model and supports various tasks (e.g., knowledge graph completion, question answering, recommendation, and knowledge probing). PromptKG is publicly open-sourced at https://github.com/zjunlp/PromptKG with long-term technical support.
    Conceptualized Representation Learning for Chinese Biomedical Text Mining. (arXiv:2008.10813v1 [cs.CL] CROSS LISTED)
    Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.
    Normal vs. Adversarial: Salience-based Analysis of Adversarial Samples for Relation Extraction. (arXiv:2104.00312v4 [cs.CL] CROSS LISTED)
    Recent neural-based relation extraction approaches, though achieving promising improvement on benchmark datasets, have reported their vulnerability towards adversarial attacks. Thus far, efforts mostly focused on generating adversarial samples or defending adversarial attacks, but little is known about the difference between normal and adversarial samples. In this work, we take the first step to leverage the salience-based method to analyze those adversarial samples. We observe that salience tokens have a direct correlation with adversarial perturbations. We further find the adversarial perturbations are either those tokens not existing in the training set or superficial cues associated with relation labels. To some extent, our approach unveils the characters against adversarial samples. We release an open-source testbed, "DiagnoseAdv" in https://github.com/zjunlp/DiagnoseAdv.
    Reasoning Through Memorization: Nearest Neighbor Knowledge Graph Embeddings. (arXiv:2201.05575v2 [cs.CL] CROSS LISTED)
    Previous knowledge graph embedding approaches usually map entities to representations and utilize score functions to predict the target entities, yet they struggle to reason rare or emerging unseen entities. In this paper, we propose kNN-KGE, a new knowledge graph embedding approach with pre-trained language models, by linearly interpolating its entity distribution with k-nearest neighbors. We compute the nearest neighbors based on the distance in the entity embedding space from the knowledge store. Our approach can allow rare or emerging entities to be memorized explicitly rather than implicitly in model parameters. Experimental results demonstrate that our approach can improve inductive and transductive link prediction results and yield better performance for low-resource settings with only a few triples, which might be easier to reason via explicit memory. Code is available at https://github.com/zjunlp/KNN-KG.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Logic-Based Explainability in Machine Learning. (arXiv:2211.00541v2 [cs.AI] UPDATED)
    The last decade witnessed an ever-increasing stream of successes in Machine Learning (ML). These successes offer clear evidence that ML is bound to become pervasive in a wide range of practical uses, including many that directly affect humans. Unfortunately, the operation of the most successful ML models is incomprehensible for human decision makers. As a result, the use of ML models, especially in high-risk and safety-critical settings is not without concern. In recent years, there have been efforts on devising approaches for explaining ML models. Most of these efforts have focused on so-called model-agnostic approaches. However, all model-agnostic and related approaches offer no guarantees of rigor, hence being referred to as non-formal. For example, such non-formal explanations can be consistent with different predictions, which renders them useless in practice. This paper overviews the ongoing research efforts on computing rigorous model-based explanations of ML models; these being referred to as formal explanations. These efforts encompass a variety of topics, that include the actual definitions of explanations, the characterization of the complexity of computing explanations, the currently best logical encodings for reasoning about different ML models, and also how to make explanations interpretable for human decision makers, among others.
    Disentangled Contrastive Learning for Learning Robust Textual Representations. (arXiv:2104.04907v2 [cs.CL] CROSS LISTED)
    Although the self-supervised pre-training of transformer models has resulted in the revolutionizing of natural language processing (NLP) applications and the achievement of state-of-the-art results with regard to various benchmarks, this process is still vulnerable to small and imperceptible permutations originating from legitimate inputs. Intuitively, the representations should be similar in the feature space with subtle input permutations, while large variations occur with different meanings. This motivates us to investigate the learning of robust textual representation in a contrastive manner. However, it is non-trivial to obtain opposing semantic instances for textual samples. In this study, we propose a disentangled contrastive learning method that separately optimizes the uniformity and alignment of representations without negative sampling. Specifically, we introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity. Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines, as well as achieve promising improvements with invariance tests and adversarial attacks. The code is available in https://github.com/zxlzr/DCL.
    Multi-Agent Deep Reinforcement Learning for Efficient Passenger Delivery in Urban Air Mobility. (arXiv:2211.06890v2 [cs.MA] UPDATED)
    It has been considered that urban air mobility (UAM), also known as drone-taxi or electrical vertical takeoff and landing (eVTOL), will play a key role in future transportation. By putting UAM into practical future transportation, several benefits can be realized, i.e., (i) the total travel time of passengers can be reduced compared to traditional transportation and (ii) there is no environmental pollution and no special labor costs to operate the system because electric batteries will be used in UAM system. However, there are various dynamic and uncertain factors in the flight environment, i.e., passenger sudden service requests, battery discharge, and collision among UAMs. Therefore, this paper proposes a novel cooperative MADRL algorithm based on centralized training and distributed execution (CTDE) concepts for reliable and efficient passenger delivery in UAM networks. According to the performance evaluation results, we confirm that the proposed algorithm outperforms other existing algorithms in terms of the number of serviced passengers increase (30%) and the waiting time per serviced passenger decrease (26%).
    Self-Supervised Hierarchical Metrical Structure Modeling. (arXiv:2210.17183v2 [cs.SD] UPDATED)
    We propose a novel method to model hierarchical metrical structures for both symbolic music and audio signals in a self-supervised manner with minimal domain knowledge. The model trains and inferences on beat-aligned music signals and predicts an 8-layer hierarchical metrical tree from beat, measure to the section level. The training procedure does not require any hierarchical metrical labeling except for beats, purely relying on the nature of metrical regularity and inter-voice consistency as inductive biases. We show in experiments that the method achieves comparable performance with supervised baselines on multiple metrical structure analysis tasks on both symbolic music and audio signals. All demos, source code and pre-trained models are publicly available on GitHub.
    On Robustness and Bias Analysis of BERT-based Relation Extraction. (arXiv:2009.06206v5 [cs.CL] CROSS LISTED)
    Fine-tuning pre-trained models have achieved impressive performance on standard natural language processing benchmarks. However, the resultant model generalizability remains poorly understood. We do not know, for example, how excellent performance can lead to the perfection of generalization models. In this study, we analyze a fine-tuned BERT model from different perspectives using relation extraction. We also characterize the differences in generalization techniques according to our proposed improvements. From empirical experimentation, we find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic). These findings highlight opportunities for future improvements. Our open-sourced testbed DiagnoseRE is available in \url{https://github.com/zjunlp/DiagnoseRE}.
    Kformer: Knowledge Injection in Transformer Feed-Forward Layers. (arXiv:2201.05742v2 [cs.CL] CROSS LISTED)
    Recent days have witnessed a diverse set of knowledge injection models for pre-trained language models (PTMs); however, most previous studies neglect the PTMs' own ability with quantities of implicit knowledge stored in parameters. A recent study has observed knowledge neurons in the Feed Forward Network (FFN), which are responsible for expressing factual knowledge. In this work, we propose a simple model, Kformer, which takes advantage of the knowledge stored in PTMs and external knowledge via knowledge injection in Transformer FFN layers. Empirically results on two knowledge-intensive tasks, commonsense reasoning (i.e., SocialIQA) and medical question answering (i.e., MedQA-USMLE), demonstrate that Kformer can yield better performance than other knowledge injection technologies such as concatenation or attention-based injection. We think the proposed simple model and empirical findings may be helpful for the community to develop more powerful knowledge injection methods. Code available in https://github.com/zjunlp/Kformer.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    On the Probability of Necessity and Sufficiency of Explaining Graph Neural Networks: A Lower Bound Optimization Approach. (arXiv:2212.07056v2 [cs.LG] UPDATED)
    The explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a heuristic trade-off between the two. Theoretically, the Probability of Necessity and Sufficiency (PNS) can be applied to search for the most necessary and sufficient explanation since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limit its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples and leverage continuous masks with a sampling strategy to optimize the lower bound. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
    Learning to Ask for Data-Efficient Event Argument Extraction. (arXiv:2110.00479v1 [cs.CL] CROSS LISTED)
    Event argument extraction (EAE) is an important task for information extraction to discover specific argument roles. In this study, we cast EAE as a question-based cloze task and empirically analyze fixed discrete token template performance. As generating human-annotated question templates is often time-consuming and labor-intensive, we further propose a novel approach called "Learning to Ask," which can learn optimized question templates for EAE without human annotations. Experiments using the ACE-2005 dataset demonstrate that our method based on optimized questions achieves state-of-the-art performance in both the few-shot and supervised settings.
    Communication-Efficient Diffusion Strategy for Performance Improvement of Federated Learning with Non-IID Data. (arXiv:2207.07493v3 [cs.DC] UPDATED)
    Federated learning (FL) is a novel learning paradigm that addresses the privacy leakage challenge of centralized learning. However, in FL, users with non-independent and identically distributed (non-IID) characteristics can deteriorate the performance of the global model. Specifically, the global model suffers from the weight divergence challenge owing to non-IID data. To address the aforementioned challenge, we propose a novel diffusion strategy of the machine learning (ML) model (FedDif) to maximize the FL performance with non-IID data. In FedDif, users spread local models to neighboring users over D2D communications. FedDif enables the local model to experience different distributions before parameter aggregation. Furthermore, we theoretically demonstrate that FedDif can circumvent the weight divergence challenge. On the theoretical basis, we propose the communication-efficient diffusion strategy of the ML model, which can determine the trade-off between the learning performance and communication cost based on auction theory. The performance evaluation results show that FedDif improves the test accuracy of the global model by 10.37% compared to the baseline FL with non-IID settings. Moreover, FedDif improves the number of consumed sub-frames by 1.28 to 2.85 folds to the latest methods except for the model compression scheme. FedDif also improves the number of transmitted models by 1.43 to 2.67 folds to the latest methods.
    Relation Adversarial Network for Low Resource Knowledge Graph Completion. (arXiv:1911.03091v6 [cs.CL] CROSS LISTED)
    Knowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is a low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction.
    Schema-aware Reference as Prompt Improves Data-Efficient Relational Triple and Event Extraction. (arXiv:2210.10709v3 [cs.CL] CROSS LISTED)
    Information Extraction, which aims to extract structural relational triple or event from unstructured texts, often suffers from data scarcity issues. With the development of pre-trained language models, many prompt-based approaches to data-efficient information extraction have been proposed and achieved impressive performance. However, existing prompt learning methods for information extraction are still susceptible to several potential limitations: (i) semantic gap between natural language and output structure knowledge with pre-defined schema; (ii) representation learning with locally individual instances limits the performance given the insufficient features. In this paper, we propose a novel approach of schema-aware Reference As Prompt (RAP), which dynamically leverage schema and knowledge inherited from global (few-shot) training data for each sample. Specifically, we propose a schema-aware reference store, which unifies symbolic schema and relevant textual instances. Then, we employ a dynamic reference integration module to retrieve pertinent knowledge from the datastore as prompts during training and inference. Experimental results demonstrate that RAP can be plugged into various existing models and outperforms baselines in low-resource settings on four datasets of relational triple extraction and event extraction. In addition, we provide comprehensive empirical ablations and case analysis regarding different types and scales of knowledge in order to better understand the mechanisms of RAP. Code is available in https://github.com/zjunlp/RAP.
    LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training. (arXiv:2112.01404v2 [cs.CL] CROSS LISTED)
    Natural language generation from structured data mainly focuses on surface-level descriptions, suffering from uncontrollable content selection and low fidelity. Previous works leverage logical forms to facilitate logical knowledge-conditioned text generation. Though achieving remarkable progress, they are data-hungry, which makes the adoption for real-world applications challenging with limited data. To this end, this paper proposes a unified framework for logical knowledge-conditioned text generation in the few-shot setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach leverages self-training and samples pseudo logical forms based on content and structure consistency. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.
    Document-level Relation Extraction as Semantic Segmentation. (arXiv:2106.03618v2 [cs.CL] CROSS LISTED)
    Document-level relation extraction aims to extract relations among multiple entity pairs from a document. Previously proposed graph-based or transformer-based models utilize the entities independently, regardless of global information among relational triples. This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information, parallel to the semantic segmentation task in computer vision. Herein, we propose a Document U-shaped Network for document-level relation extraction. Specifically, we leverage an encoder module to capture the context information of entities and a U-shaped segmentation module over the image-style feature map to capture global interdependency among triples. Experimental results show that our approach can obtain state-of-the-art performance on three benchmark datasets DocRED, CDR, and GDA.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    Variance-Reduced Conservative Policy Iteration. (arXiv:2212.06283v2 [cs.LG] UPDATED)
    We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
    Linear TreeShap. (arXiv:2209.08192v2 [cs.LG] UPDATED)
    Decision trees are well-known due to their ease of interpretability. To improve accuracy, we need to grow deep trees or ensembles of trees. These are hard to interpret, offsetting their original benefits. Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. It provides a linear weighting to features independent of the tree structure. The rise in popularity is mainly due to TreeShap, which solves a general exponential complexity problem in polynomial time. Following extensive adoption in the industry, more efficient algorithms are required. This paper presents a more efficient and straightforward algorithm: Linear TreeShap. Like TreeShap, Linear TreeShap is exact and requires the same amount of memory.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    A Sequential Deep Learning Algorithm for Sampled Mixed-integer Optimisation Problems. (arXiv:2301.10703v1 [math.OC])
    Mixed-integer optimisation problems can be computationally challenging. Here, we introduce and analyse two efficient algorithms with a specific sequential design that are aimed at dealing with sampled problems within this class. At each iteration step of both algorithms, we first test the feasibility of a given test solution for each and every constraint associated with the sampled optimisation at hand, while also identifying those constraints that are violated. Subsequently, an optimisation problem is constructed with a constraint set consisting of the current basis -- namely the smallest set of constraints that fully specifies the current test solution -- as well as constraints related to a limited number of the identified violating samples. We show that both algorithms exhibit finite-time convergence towards the optimal solution. Algorithm 2 features a neural network classifier that notably improves the computational performance compared to Algorithm 1. We establish quantitatively the efficacy of these algorithms by means of three numerical tests: robust optimal power flow, robust unit commitment, and robust random mixed-integer linear program.
    Obstacle Identification and Ellipsoidal Decomposition for Fast Motion Planning in Unknown Dynamic Environments. (arXiv:2209.14233v2 [cs.RO] UPDATED)
    Collision avoidance in the presence of dynamic obstacles in unknown environments is one of the most critical challenges for unmanned systems. In this paper, we present a method that identifies obstacles in terms of ellipsoids to estimate linear and angular obstacle velocities. Our proposed method is based on the idea of any object can be approximately expressed by ellipsoids. To achieve this, we propose a method based on variational Bayesian estimation of Gaussian mixture model, the Kyachiyan algorithm, and a refinement algorithm. Our proposed method does not require knowledge of the number of clusters and can operate in real-time, unlike existing optimization-based methods. In addition, we define an ellipsoid-based feature vector to match obstacles given two timely close point frames. Our method can be applied to any environment with static and dynamic obstacles, including the ones with rotating obstacles. We compare our algorithm with other clustering methods and show that when coupled with a trajectory planner, the overall system can efficiently traverse unknown environments in the presence of dynamic obstacles.
    A General Stochastic Optimization Framework for Convergence Bidding. (arXiv:2210.06543v3 [math.OC] UPDATED)
    Convergence (virtual) bidding is an important part of two-settlement electric power markets as it can effectively reduce discrepancies between the day-ahead and real-time markets. Consequently, there is extensive research into the bidding strategies of virtual participants aiming to obtain optimal bids to submit to the day-ahead market. In this paper, we introduce a price-based general stochastic optimization framework to obtain optimal convergence bid curves. Within this framework, we develop a computationally tractable linear programming-based optimization model, which produces bid prices and volumes simultaneously. We also show that different approximations and simplifications in the general model lead naturally to state-of-the-art convergence bidding approaches, such as self-scheduling and opportunistic approaches. Our general framework also provides a straightforward way to compare the performance of these models, which is demonstrated by numerical experiments on the California (CAISO) market.
    Adversarial De-confounding in Individualised Treatment Effects Estimation. (arXiv:2210.10530v3 [cs.LG] UPDATED)
    Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
    Context-aware Deep Model for Entity Recommendation in Search Engine at Alibaba. (arXiv:1909.04493v1 [cs.IR] CROSS LISTED)
    Entity recommendation, providing search users with an improved experience via assisting them in finding related entities for a given query, has become an indispensable feature of today's search engines. Existing studies typically only consider the queries with explicit entities. They usually fail to handle complex queries that without entities, such as "what food is good for cold weather", because their models could not infer the underlying meaning of the input text. In this work, we believe that contexts convey valuable evidence that could facilitate the semantic modeling of queries, and take them into consideration for entity recommendation. In order to better model the semantics of queries and entities, we learn the representation of queries and entities jointly with attentive deep neural networks. We evaluate our approach using large-scale, real-world search logs from a widely used commercial Chinese search engine. Our system has been deployed in ShenMa Search Engine and you can fetch it in UC Browser of Alibaba. Results from online A/B test suggest that the impression efficiency of click-through rate increased by 5.1% and page view increased by 5.5%.
    Convergence of Random Reshuffling Under The Kurdyka-{\L}ojasiewicz Inequality. (arXiv:2110.04926v4 [math.OC] UPDATED)
    We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
    A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data. (arXiv:2207.00812v3 [q-bio.QM] UPDATED)
    There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. The paper provides a critical outlook into contemporary methods for explainability and interpretabiltiy used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific.
    Asymptotic Analysis of Deep Residual Networks. (arXiv:2212.08199v2 [cs.LG] UPDATED)
    We investigate the asymptotic properties of deep Residual networks (ResNets) as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither of these. In particular, our findings point to the existence of a diffusive regime in which the deep network limit is described by a class of stochastic differential equations (SDEs). Finally, we derive the corresponding scaling limits for the backpropagation dynamics.
    To be or not to be stable, that is the question: understanding neural networks for inverse problems. (arXiv:2211.13692v2 [math.NA] UPDATED)
    The solution of linear inverse problems arising, for example, in signal and image processing is a challenging problem since the ill-conditioning amplifies the noise on the data. Recently introduced algorithms based on deep learning overwhelm the more traditional model-based approaches, but they typically suffer from instability with respect to data perturbation. In this paper, we theoretically analyze the trade-off between neural networks stability and accuracy in the solution of linear inverse problems. Moreover, we propose different supervised and unsupervised solutions to increase network stability that maintains good accuracy by inheriting, in the network training, regularization from a model-based iterative scheme. Extensive numerical experiments on image deblurring confirm the theoretical results and the effectiveness of the proposed deep learning-based solutions to stably solve noisy inverse problems.
    Connecting metrics for shape-texture knowledge in computer vision. (arXiv:2301.10608v1 [cs.CV])
    Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.
    FewShotTextGCN: K-hop neighborhood regularization for few-shot learning on graphs. (arXiv:2301.10481v1 [cs.CL])
    We present FewShotTextGCN, a novel method designed to effectively utilize the properties of word-document graphs for improved learning in low-resource settings. We introduce K-hop Neighbourhood Regularization, a regularizer for heterogeneous graphs, and show that it stabilizes and improves learning when only a few training samples are available. We furthermore propose a simplification in the graph-construction method, which results in a graph that is $\sim$7 times less dense and yields better performance in little-resource settings while performing on par with the state of the art in high-resource settings. Finally, we introduce a new variant of Adaptive Pseudo-Labeling tailored for word-document graphs. When using as little as 20 samples for training, we outperform a strong TextGCN baseline with 17% in absolute accuracy on average over eight languages. We demonstrate that our method can be applied to document classification without any language model pretraining on a wide range of typologically diverse languages while performing on par with large pretrained language models.
    User-Interactive Offline Reinforcement Learning. (arXiv:2205.10629v2 [cs.LG] UPDATED)
    Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.
    Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games. (arXiv:2111.06008v3 [cs.LG] UPDATED)
    Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\tilde{O}(T^{-1})$. This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng (NeurIPS`20), and it is optimal -- within the no-regret framework -- up to polylogarithmic factors in $T$. To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we establish that the no-internal-regret learning dynamics of Stoltz and Lugosi (Mach Learn`05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as DFG to near-optimally bound the internal regret. Moreover, we establish an $O(\textrm{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour (BM) (JMLR`07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of DFG. In addition to shedding clarity on the near-optimal regret guarantees of BM, our arguments provide insights into the various ways in which the techniques by DFG can be extended and leveraged in the analysis of more involved learning algorithms.
    Distributed Control of Partial Differential Equations Using Convolutional Reinforcement Learning. (arXiv:2301.10737v1 [cs.LG])
    We present a convolutional framework which significantly reduces the complexity and thus, the computational effort for distributed reinforcement learning control of dynamical systems governed by partial differential equations (PDEs). Exploiting translational invariances, the high-dimensional distributed control problem can be transformed into a multi-agent control problem with many identical, uncoupled agents. Furthermore, using the fact that information is transported with finite velocity in many cases, the dimension of the agents' environment can be drastically reduced using a convolution operation over the state space of the PDE. In this setting, the complexity can be flexibly adjusted via the kernel width or by using a stride greater than one. Moreover, scaling from smaller to larger systems -- or the transfer between different domains -- becomes a straightforward task requiring little effort. We demonstrate the performance of the proposed framework using several PDE examples with increasing complexity, where stabilization is achieved by training a low-dimensional deep deterministic policy gradient agent using minimal computing resources.
    Tighter Bounds on the Expressivity of Transformer Encoders. (arXiv:2301.10743v1 [cs.LG])
    Characterizing neural networks in terms of better-understood formal systems has the potential to yield new insights into the power and limitations of these networks. Doing so for transformers remains an active area of research. Bhattamishra and others have shown that transformer encoders are at least as expressive as a certain kind of counter machine, while Merrill and Sabharwal have shown that fixed-precision transformer encoders recognize only languages in uniform $TC^0$. We connect and strengthen these results by identifying a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders. This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
    Spatio-Temporal Graph Neural Networks: A Survey. (arXiv:2301.10569v1 [cs.LG])
    Graph Neural Networks have gained huge interest in the past few years. These powerful algorithms expanded deep learning models to non-Euclidean space and were able to achieve state of art performance in various applications including recommender systems and social networks. However, this performance is based on static graph structures assumption which limits the Graph Neural Networks performance when the data varies with time. Temporal Graph Neural Networks are extension of Graph Neural Networks that takes the time factor into account. Recently, various Temporal Graph Neural Network algorithms were proposed and achieved superior performance compared to other deep learning algorithms in several time dependent applications. This survey discusses interesting topics related to Spatio temporal Graph Neural Networks, including algorithms, application, and open challenges.
    Evaluation of the syllables pronunciation quality in speech rehabilitation through the solution of the classification problem. (arXiv:2301.10585v1 [cs.LG])
    The solution of the problem of assessing the quality of the pronunciation of syllables during speech rehabilitation after surgical treatment of oncological diseases of the organs of the speech-forming tract is considered in the work. The assessment is carried out by solving the problem of classifying syllables into two classes: before and immediately after surgical treatment. A classifier is built on the basis of the LSTM neural network and trained on the records before the operation and immediately after it, before the start of speech rehabilitation. The measure of assessing the quality of syllables pronunciation in the process of rehabilitation is the metric of belonging to the class before the operation. A study is being made of the influence of taking into account problematic phonemes, the gender of the patient, his individual characteristics on the resulting estimates of the quality of pronunciation. A comparison with existing types of syllable pronunciation quality assessments is carried out, recommendations are given for the practical application of the resulting new class of pronunciation quality assessments.
    Prediction of COVID-19 by Its Variants using Multivariate Data-driven Deep Learning Models. (arXiv:2301.10616v1 [cs.CE])
    The Coronavirus Disease 2019 or the COVID-19 pandemic has swept almost all parts of the world since the first case was found in Wuhan, China, in December 2019. With the increasing number of COVID-19 cases in the world, SARS-CoV-2 has mutated into various variants. Given the increasingly dangerous conditions of the pandemic, it is crucial to know when the pandemic will stop by predicting confirmed cases of COVID-19. Therefore, many studies have raised COVID-19 as a case study to overcome the ongoing pandemic using the Deep Learning method, namely LSTM, with reasonably accurate results and small error values. LSTM training is used to predict confirmed cases of COVID-19 based on variants that have been identified using ECDC's COVID-19 dataset containing confirmed cases of COVID-19 that have been identified from 30 countries in Europe. Tests were conducted using the LSTM and BiLSTM models with the addition of RNN as comparisons on hidden size and layer size. The obtained result showed that in testing hidden sizes 25, 50, 75 to 100, the RNN model provided better results, with the minimum MSE value of 0.01 and the RMSE value of 0.012 for B.1.427/B.1.429 variant with hidden size 100. In further testing of layer sizes 2, 3, 4, and 5, the result shows that the BiLSTM model provided better results, with minimum MSE value of 0.01 and the RMSE of 0.01 for the B.1.427/B.1.429 variant with hidden size 100 and layer size 2.
    Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs. (arXiv:2106.02684v3 [cs.LG] UPDATED)
    We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.
    Trainable Loss Weights in Super-Resolution. (arXiv:2301.10575v1 [cs.CV])
    In recent years, research on super-resolution has primarily focused on the development of unsupervised models, blind networks, and the use of optimization methods in non-blind models. But, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity in a conventional way. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network using this criterion. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As shown in the experimental results section, weighted loss by the proposed method leads to better results than the unweighted loss in both signal-to-noise and perceptual similarity senses.
    Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training. (arXiv:2206.11215v3 [cs.CV] UPDATED)
    We consider a certifiable object pose estimation problem, where -- given a partial point cloud of an object -- the goal is to not only estimate the object pose, but also to provide a certificate of correctness for the resulting estimate. Our first contribution is a general theory of certification for end-to-end perception models. In particular, we introduce the notion of $\zeta$-correctness, which bounds the distance between an estimate and the ground truth. We show that $\zeta$-correctness can be assessed by implementing two certificates: (i) a certificate of observable correctness, that asserts if the model output is consistent with the input data and prior information, (ii) a certificate of non-degeneracy, that asserts whether the input data is sufficient to compute a unique estimate. Our second contribution is to apply this theory and design a new learning-based certifiable pose estimator. We propose C-3PO, a semantic-keypoint-based pose estimation model, augmented with the two certificates, to solve the certifiable pose estimation problem. C-3PO also includes a keypoint corrector, implemented as a differentiable optimization layer, that can correct large detection errors (e.g. due to the sim-to-real gap). Our third contribution is a novel self-supervised training approach that uses our certificate of observable correctness to provide the supervisory signal to C-3PO during training. In it, the model trains only on the observably correct input-output pairs, in each training iteration. As training progresses, we see that the observably correct input-output pairs grow, eventually reaching near 100% in many cases. Our experiments show that (i) standard semantic-keypoint-based methods outperform more recent alternatives, (ii) C-3PO further improves performance and significantly outperforms all the baselines, and (iii) C-3PO's certificates are able to discern correct pose estimates.
    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction. (arXiv:2010.16059v1 [cs.CL] CROSS LISTED)
    Current supervised relational triple extraction approaches require huge amounts of labeled data and thus suffer from poor performance in few-shot settings. However, people can grasp new knowledge by learning a few instances. To this end, we take the first step to study the few-shot relational triple extraction, which has not been well understood. Unlike previous single-task few-shot problems, relational triple extraction is more challenging as the entities and relations have implicit correlations. In this paper, We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples, namely, entity pairs and corresponding relations. To be specific, we design a hybrid prototypical learning mechanism that bridges text and knowledge concerning both entities and relations. Thus, implicit correlations between entities and relations are injected. Additionally, we propose a prototype-aware regularization to learn more representative prototypes. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.
    LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting. (arXiv:2109.00720v5 [cs.CL] CROSS LISTED)
    Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings. Code is in https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot.
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v3 [cs.CV] UPDATED)
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.
    PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning. (arXiv:2301.10681v1 [cs.LG])
    Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
    Probing Taxonomic and Thematic Embeddings for Taxonomic Information. (arXiv:2301.10656v1 [cs.CL])
    Modelling taxonomic and thematic relatedness is important for building AI with comprehensive natural language understanding. The goal of this paper is to learn more about how taxonomic information is structurally encoded in embeddings. To do this, we design a new hypernym-hyponym probing task and perform a comparative probing study of taxonomic and thematic SGNS and GloVe embeddings. Our experiments indicate that both types of embeddings encode some taxonomic information, but the amount, as well as the geometric properties of the encodings, are independently related to both the encoder architecture, as well as the embedding training data. Specifically, we find that only taxonomic embeddings carry taxonomic information in their norm, which is determined by the underlying distribution in the data.
    XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. (arXiv:2301.10472v1 [cs.CL])
    Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
    A Boosting Approach to Reinforcement Learning. (arXiv:2108.09767v2 [cs.LG] UPDATED)
    Reducing reinforcement learning to supervised learning is a well-studied and effective approach that leverages the benefits of compact function approximation to deal with large-scale Markov decision processes. Independently, the boosting methodology (e.g. AdaBoost) has proven to be indispensable in designing efficient and accurate classification algorithms by combining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to a sequence of weak learning problems. Since weak learners perform only marginally better than random guesses, such subroutines constitute a weaker assumption than the availability of an accurate supervised learning oracle. We prove that the sample complexity and running time bounds of the proposed method do not explicitly depend on the number of states. While existing results on boosting operate on convex losses, the value function over policies is non-convex. We show how to use a non-convex variant of the Frank-Wolfe method for boosting, that additionally improves upon the known sample complexity and running time even for reductions to supervised learning.
    Transfer Learning in Deep Learning Models for Building Load Forecasting: Case of Limited Data. (arXiv:2301.10663v1 [cs.LG])
    Precise load forecasting in buildings could increase the bill savings potential and facilitate optimized strategies for power generation planning. With the rapid evolution of computer science, data-driven techniques, in particular the Deep Learning models, have become a promising solution for the load forecasting problem. These models have showed accurate forecasting results; however, they need abundance amount of historical data to maintain the performance. Considering the new buildings and buildings with low resolution measuring equipment, it is difficult to get enough historical data from them, leading to poor forecasting performance. In order to adapt Deep Learning models for buildings with limited and scarce data, this paper proposes a Building-to-Building Transfer Learning framework to overcome the problem and enhance the performance of Deep Learning models. The transfer learning approach was applied to a new technique known as Transformer model due to its efficacy in capturing data trends. The performance of the algorithm was tested on a large commercial building with limited data. The result showed that the proposed approach improved the forecasting accuracy by 56.8% compared to the case of conventional deep learning where training from scratch is used. The paper also compared the proposed Transformer model to other sequential deep learning models such as Long-short Term Memory (LSTM) and Recurrent Neural Network (RNN). The accuracy of the transformer model outperformed other models by reducing the root mean square error to 0.009, compared to LSTM with 0.011 and RNN with 0.051.
    Dimensionality Expansion of Load Monitoring Time Series and Transfer Learning for EMS. (arXiv:2204.02802v3 [cs.LG] UPDATED)
    Energy management systems (EMS) rely on (non)-intrusive load monitoring (N)ILM to monitor and manage appliances and help residents be more energy efficient and thus more frugal. The robustness as well as the transfer potential of the most promising machine learning solutions for (N)ILM is not yet fully understood as they are trained and evaluated on relatively limited data. In this paper, we propose a new approach for load monitoring in building EMS based on dimensionality expansion of time series and transfer learning. We perform an extensive evaluation on 5 different low-frequency datasets. The proposed feature dimensionality expansion using video-like transformation and resource-aware deep learning architecture achieves an average weighted F1 score of 0.88 across the datasets with 29 appliances and is computationally more efficient compared to the state-of-the-art imaging methods. Investigating the proposed method for cross-dataset intra-domain transfer learning, we find that 1) our method performs with an average weighted F1 score of 0.80 while requiring 3-times fewer epochs for model training compared to the non-transfer approach, 2) can achieve an F1 score of 0.75 with only 230 data samples, and 3) our transfer approach outperforms the state-of-the-art in precision drop by up to 12 percentage points for unseen appliances.
    What are the Machine Learning best practices reported by practitioners on Stack Exchange?. (arXiv:2301.10516v1 [cs.SE])
    Machine Learning (ML) is being used in multiple disciplines due to its powerful capability to infer relationships within data. In particular, Software Engineering (SE) is one of those disciplines in which ML has been used for multiple tasks, like software categorization, bugs prediction, and testing. In addition to the multiple ML applications, some studies have been conducted to detect and understand possible pitfalls and issues when using ML. However, to the best of our knowledge, only a few studies have focused on presenting ML best practices or guidelines for the application of ML in different domains. In addition, the practices and literature presented in previous literature (i) are domain-specific (e.g., concrete practices in biomechanics), (ii) describe few practices, or (iii) the practices lack rigorous validation and are presented in gray literature. In this paper, we present a study listing 127 ML best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites and validated by four independent ML experts. The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system; for each practice, we include explanations and examples. In all the practices, the provided examples focus on SE tasks. We expect this list of practices could help practitioners to understand better the practices and use ML in a more informed way, in particular newcomers to this new area that sits at the intersection of software engineering and machine learning.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Backward Compatibility During Data Updates by Weight Interpolation. (arXiv:2301.10546v1 [cs.LG])
    Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified. A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts. (arXiv:2301.10604v1 [cs.CL])
    The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amount of news articles and social media data reflecting opposing ideologies and narratives. These polarized campaigns have led to mutual accusations of misinformation and fake news, shaping an atmosphere of confusion and mistrust for readers worldwide. This study analyses how the media affected and mirrored public opinion during the first month of the war using news articles and Telegram news channels in Ukrainian, Russian, Romanian and English. We propose and compare two methods of multilingual automated pro-Kremlin propaganda identification, based on Transformers and linguistic features. We analyse the advantages and disadvantages of both methods, their adaptability to new genres and languages, and ethical considerations of their usage for content moderation. With this work, we aim to lay the foundation for further development of moderation tools tailored to the current conflict.  ( 2 min )
    E(n)-equivariant Graph Neural Cellular Automata. (arXiv:2301.10497v1 [cs.LG])
    Cellular automata (CAs) are computational models exhibiting rich dynamics emerging from the local interaction of cells arranged in a regular lattice. Graph CAs (GCAs) generalise standard CAs by allowing for arbitrary graphs rather than regular lattices, similar to how Graph Neural Networks (GNNs) generalise Convolutional NNs. Recently, Graph Neural CAs (GNCAs) have been proposed as models built on top of standard GNNs that can be trained to approximate the transition rule of any arbitrary GCA. Existing GNCAs are anisotropic in the sense that their transition rules are not equivariant to translation, rotation, and reflection of the nodes' spatial locations. However, it is desirable for instances related by such transformations to be treated identically by the model. By replacing standard graph convolutions with E(n)-equivariant ones, we avoid anisotropy by design and propose a class of isotropic automata that we call E(n)-GNCAs. These models are lightweight, but can nevertheless handle large graphs, capture complex dynamics and exhibit emergent self-organising behaviours. We showcase the broad and successful applicability of E(n)-GNCAs on three different tasks: (i) pattern formation, (ii) graph auto-encoding, and (iii) simulation of E(n)-equivariant dynamical systems.  ( 2 min )
    When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning. (arXiv:2301.10400v1 [cs.LG])
    Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate and can effectively reflect the merged data distribution of sampled clients, and we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method.  ( 2 min )
    Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes. (arXiv:2206.08852v2 [cs.LG] UPDATED)
    Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.
    Improved Stock Price Movement Classification Using News Articles Based on Embeddings and Label Smoothing. (arXiv:2301.10458v1 [cs.LG])
    Stock price movement prediction is a challenging and essential problem in finance. While it is well established in modern behavioral finance that the share prices of related stocks often move after the release of news via reactions and overreactions of investors, how to capture the relationships between price movements and news articles via quantitative models is an active area research; existing models have achieved success with variable degrees. In this paper, we propose to improve stock price movement classification using news articles by incorporating regularization and optimization techniques from deep learning. More specifically, we capture the dependencies between news articles and stocks through embeddings and bidirectional recurrent neural networks as in recent models. We further incorporate weight decay, batch normalization, dropout, and label smoothing to improve the generalization of the trained models. To handle high fluctuations of validation accuracy of batch normalization, we propose dual-phase training to realize the improvements reliably. Our experimental results on a commonly used dataset show significant improvements, achieving average accuracy of 80.7% on the test set, which is more than 10.0% absolute improvement over existing models. Our ablation studies show batch normalization and label smoothing are most effective, leading to 6.0% and 3.4% absolute improvement, respectively on average.  ( 2 min )
    SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators. (arXiv:2301.10388v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they overcome the limited applicability of prior neural networks. A GCN takes as input an arbitrarily structured graph and executes a series of layers which exploit the graph's structure to calculate their output features. One recent trend in GCNs is the use of deep network architectures. As opposed to the traditional GCNs which only span around two to five layers deep, modern GCNs now incorporate tens to hundreds of layers with the help of residual connections. From such deep GCNs, we find an important characteristic that they exhibit very high intermediate feature sparsity. We observe that with deep layers and residual connections, the number of zeros in the intermediate features sharply increases. This reveals a new opportunity for accelerators to exploit in GCN executions that was previously not present. In this paper, we propose SGCN, a fast and energy-efficient GCN accelerator which fully exploits the sparse intermediate features of modern GCNs. SGCN suggests several techniques to achieve significantly higher performance and energy efficiency than the existing accelerators. First, SGCN employs a GCN-friendly feature compression format. We focus on reducing the off-chip memory traffic, which often is the bottleneck for GCN executions. Second, we propose microarchitectures for seamlessly handling the compressed feature format. Third, to better handle locality in the existence of the varying sparsity, SGCN employs sparsity-aware cooperation. Sparsity-aware cooperation creates a pattern that exhibits multiple reuse windows, such that the cache can capture diverse sizes of working sets and therefore adapt to the varying level of sparsity. We show that SGCN achieves 1.71x speedup and 43.9% higher energy efficiency compared to the existing accelerators.  ( 2 min )
    Overcoming Prior Misspecification in Online Learning to Rank. (arXiv:2301.10651v1 [cs.LG])
    The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.  ( 2 min )
    Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning. (arXiv:2301.10500v1 [cs.LG])
    We propose `Banker-OMD`, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The `Banker-OMD` framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus allowing the easy design of new algorithms capable of easily and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. `Banker-OMD` leads to the first delayed scale-free adversarial MAB algorithm achieving $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret and the first delayed adversarial linear bandit algorithm achieving $\tilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\tilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $\Omega(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.  ( 2 min )
    Integrating Local Real Data with Global Gradient Prototypes for Classifier Re-Balancing in Federated Long-Tailed Learning. (arXiv:2301.10394v1 [cs.LG])
    Federated Learning (FL) has become a popular distributed learning paradigm that involves multiple clients training a global model collaboratively in a data privacy-preserving manner. However, the data samples usually follow a long-tailed distribution in the real world, and FL on the decentralized and long-tailed data yields a poorly-behaved global model severely biased to the head classes with the majority of the training samples. To alleviate this issue, decoupled training has recently been introduced to FL, considering it has achieved promising results in centralized long-tailed learning by re-balancing the biased classifier after the instance-balanced training. However, the current study restricts the capacity of decoupled training in federated long-tailed learning with a sub-optimal classifier re-trained on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. In this work, in order to re-balance the classifier more effectively, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce an extra classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings.  ( 2 min )
    ViDeBERTa: A powerful pre-trained language model for Vietnamese. (arXiv:2301.10439v1 [cs.CL])
    This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.  ( 2 min )
    Capacity Analysis of Vector Symbolic Architectures. (arXiv:2301.10352v1 [cs.LG])
    Hyperdimensional computing (HDC) is a biologically-inspired framework that uses high-dimensional vectors and various vector operations to represent and manipulate symbols. The ensemble of a particular vector space and two vector operations (one addition-like for "bundling" and one outer-product-like for "binding") form what is called a "vector symbolic architecture" (VSA). While VSAs have been employed in numerous applications and studied empirically, many theoretical questions about VSAs remain open. We provide theoretical analyses for the *representation capacities* of three popular VSAs: MAP-I, MAP-B, and Binary Sparse. Representation capacity here refers to upper bounds on the dimensions of the VSA vectors required to perform certain symbolic tasks (such as testing for set membership $i \in S$ and estimating set intersection sizes $|S \cap T|$) to a given degree of accuracy. We also describe a relationship between the MAP-I VSA to Hopfield networks, which are simple models of associative memory, and analyze the ability of Hopfield networks to perform some of the same tasks that are typically asked of VSAs. Our analysis of MAP-I casts the VSA vectors as the outputs of *sketching* (dimensionality reduction) algorithms such as the Johnson-Lindenstrauss transform; this provides a clean, simple framework for obtaining bounds on MAP-I's representation capacity. We also provide, to our knowledge, the first analysis of testing set membership in a bundle of general pairwise bindings from MAP-I. Binary sparse VSAs are well-known to be related to Bloom filters; we give analyses of set intersection for Bloom and Counting Bloom filters. Our analysis of MAP-B and Binary Sparse bundling include new applications of several concentration inequalities.  ( 2 min )
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.  ( 2 min )
    HealthEdge: A Machine Learning-Based Smart Healthcare Framework for Prediction of Type 2 Diabetes in an Integrated IoT, Edge, and Cloud Computing System. (arXiv:2301.10450v1 [cs.LG])
    Diabetes Mellitus has no permanent cure to date and is one of the leading causes of death globally. The alarming increase in diabetes calls for the need to take precautionary measures to avoid/predict the occurrence of diabetes. This paper proposes HealthEdge, a machine learning-based smart healthcare framework for type 2 diabetes prediction in an integrated IoT-edge-cloud computing system. Numerical experiments and comparative analysis were carried out between the two most used machine learning algorithms in the literature, Random Forest (RF) and Logistic Regression (LR), using two real-life diabetes datasets. The results show that RF predicts diabetes with 6% more accuracy on average compared to LR.  ( 2 min )
    AutoCost: Evolving Intrinsic Cost for Zero-violation Reinforcement Learning. (arXiv:2301.10339v1 [cs.LG])
    Safety is a critical hurdle that limits the application of deep reinforcement learning (RL) to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision processes. However, such constrained RL methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safe RL benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.  ( 2 min )
    A Data-Centric Approach for Improving Adversarial Training Through the Lens of Out-of-Distribution Detection. (arXiv:2301.10454v1 [cs.LG])
    Current machine learning models achieve super-human performance in many real-world applications. Still, they are susceptible against imperceptible adversarial perturbations. The most effective solution for this problem is adversarial training that trains the model with adversarially perturbed samples instead of original ones. Various methods have been developed over recent years to improve adversarial training such as data augmentation or modifying training attacks. In this work, we examine the same problem from a new data-centric perspective. For this purpose, we first demonstrate that the existing model-based methods can be equivalent to applying smaller perturbation or optimization weights to the hard training examples. By using this finding, we propose detecting and removing these hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects. For detection, we use maximum softmax probability as an effective method in out-of-distribution detection since we can consider the hard samples as the out-of-distribution samples for the whole data distribution. Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.  ( 2 min )
    Near-Optimal No-Regret Learning in General Games. (arXiv:2108.06924v2 [cs.LG] UPDATED)
    We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Pen, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac 1T\right)$.
    A Provable Splitting Approach for Symmetric Nonnegative Matrix Factorization. (arXiv:2301.10499v1 [cs.LG])
    The symmetric Nonnegative Matrix Factorization (NMF), a special but important class of the general NMF, has found numerous applications in data analysis such as various clustering tasks. Unfortunately, designing fast algorithms for the symmetric NMF is not as easy as for its nonsymmetric counterpart, since the latter admits the splitting property that allows state-of-the-art alternating-type algorithms. To overcome this issue, we first split the decision variable and transform the symmetric NMF to a penalized nonsymmetric one, paving the way for designing efficient alternating-type algorithms. We then show that solving the penalized nonsymmetric reformulation returns a solution to the original symmetric NMF. Moreover, we design a family of alternating-type algorithms and show that they all admit strong convergence guarantee: the generated sequence of iterates is convergent and converges at least sublinearly to a critical point of the original symmetric NMF. Finally, we conduct experiments on both synthetic data and real image clustering to support our theoretical results and demonstrate the performance of the alternating-type algorithms.  ( 2 min )
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System. (arXiv:2301.10577v1 [cs.CL])
    In this work, we present ARDIAS, a web-based application that aims to provide researchers with a full suite of discovery and collaboration tools. ARDIAS currently allows searching for authors and articles by name and gaining insights into the research topics of a particular researcher. With the aid of AI-based tools, ARDIAS aims to recommend potential collaborators and topics to researchers. In the near future, we aim to add tools that allow researchers to communicate with each other and start new projects.
    Multimodal Analogical Reasoning over Knowledge Graphs. (arXiv:2210.00312v3 [cs.CL] UPDATED)
    Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. Code and datasets are available in https://github.com/zjunlp/MKG_Analogy.
    OCD: Learning to Overfit with Conditional Diffusion Models. (arXiv:2210.00471v4 [cs.LG] UPDATED)
    We present a dynamic model in which the weights are conditioned on an input sample x and are learned to match those that would be obtained by finetuning a base model on x and its label y. This mapping between an input sample and network weights is approximated by a denoising diffusion model. The diffusion model we employ focuses on modifying a single layer of the base model and is conditioned on the input, activations, and output of this layer. Since the diffusion model is stochastic in nature, multiple initializations generate different networks, forming an ensemble, which leads to further improvements. Our experiments demonstrate the wide applicability of the method for image classification, 3D reconstruction, tabular data, speech separation, and natural language processing. Our code is available at https://github.com/ShaharLutatiPersonal/OCD
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    A blob method for inhomogeneous diffusion with applications to multi-agent control and sampling. (arXiv:2202.12927v3 [math.AP] UPDATED)
    As a counterpoint to classical stochastic particle methods for linear diffusion equations, we develop a deterministic particle method for the weighted porous medium equation (WPME) and prove its convergence on bounded time intervals. This generalizes related work on blob methods for unweighted porous medium equations. From a numerical analysis perspective, our method has several advantages: it is meshfree, preserves the gradient flow structure of the underlying PDE, converges in arbitrary dimension, and captures the correct asymptotic behavior in simulations. That our method succeeds in capturing the long time behavior of WPME is significant from the perspective of related problems in quantization. Just as the Fokker-Planck equation provides a way to quantize a probability measure $\bar{\rho}$ by evolving an empirical measure according to stochastic Langevin dynamics so that the empirical measure flows toward $\bar{\rho}$, our particle method provides a way to quantize $\bar{\rho}$ according to deterministic particle dynamics approximating WMPE. In this way, our method has natural applications to multi-agent coverage algorithms and sampling probability measures. A specific case of our method corresponds exactly to confined mean-field dynamics of training a two-layer neural network for a radial basis function activation function. From this perspective, our convergence result shows that, in the overparametrized regime and as the variance of the radial basis functions goes to zero, the continuum limit is given by WPME. This generalizes previous results, which considered the case of a uniform data distribution, to the more general inhomogeneous setting. As a consequence of our convergence result, we identify conditions on the target function and data distribution for which convexity of the energy landscape emerges in the continuum limit.
    Learning to Rank Normalized Entropy Curves with Differentiable Window Transformation. (arXiv:2301.10443v1 [cs.LG])
    Recent automated machine learning systems often use learning curves ranking models to inform decisions about when to stop unpromising trials and identify better model configurations. In this paper, we present a novel learning curve ranking model specifically tailored for ranking normalized entropy (NE) learning curves, which are commonly used in online advertising and recommendation systems. Our proposed model, self-Adaptive Curve Transformation augmented Relative curve Ranking (ACTR2), features an adaptive curve transformation layer that transforms raw lifetime NE curves into composite window NE curves with the window sizes adaptively optimized based on both the position on the learning curve and the curve's dynamics. We also introduce a novel differentiable indexing method for the proposed adaptive curve transformation, which allows gradients with respect to the discrete indices to flow freely through the curve transformation layer, enabling the learned window sizes to be updated flexibly during training. Additionally, we propose a pairwise curve ranking architecture that directly models the difference between the two learning curves and is better at capturing subtle changes in relative performance that may not be evident when modeling each curve individually as the existing approaches did. Our extensive experiments on a real-world NE curve dataset demonstrate the effectiveness of each key component of ACTR2 and its improved performance over the state-of-the-art.
    MLPGradientFlow: going with the flow of multilayer perceptrons (and finding minima fast and accurately). (arXiv:2301.10638v1 [cs.LG])
    MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot \theta = -\nabla \mathcal L(\theta; \mathcal D)$, where $\theta$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes have better accuracy and convergence speed than gradient descent with the Adam optimizer. However, we find Newton's method and approximations like BFGS preferable to find fixed points (local and global minima of $\mathcal L$) efficiently and accurately. For small networks and data sets, gradients are usually computed faster than in pytorch and Hessian are computed at least $5\times$ faster. Additionally, the package features an integrator for a teacher-student setup with bias-free, two-layer networks trained with standard Gaussian input in the limit of infinite data. The code is accessible at https://github.com/jbrea/MLPGradientFlow.jl.  ( 2 min )
    Understanding and Improving Deep Graph Neural Networks: A Probabilistic Graphical Model Perspective. (arXiv:2301.10536v1 [cs.LG])
    Recently, graph-based models designed for downstream tasks have significantly advanced research on graph neural networks (GNNs). GNN baselines based on neural message-passing mechanisms such as GCN and GAT perform worse as the network deepens. Therefore, numerous GNN variants have been proposed to tackle this performance degradation problem, including many deep GNNs. However, a unified framework is still lacking to connect these existing models and interpret their effectiveness at a high level. In this work, we focus on deep GNNs and propose a novel view for understanding them. We establish a theoretical framework via inference on a probabilistic graphical model. Given the fixed point equation (FPE) derived from the variational inference on the Markov random fields, the deep GNNs, including JKNet, GCNII, DGCN, and the classical GNNs, such as GCN, GAT, and APPNP, can be regarded as different approximations of the FPE. Moreover, given this framework, more accurate approximations of FPE are brought, guiding us to design a more powerful GNN: coupling graph neural network (CoGNet). Extensive experiments are carried out on citation networks and natural language processing downstream tasks. The results demonstrate that the CoGNet outperforms the SOTA models.  ( 2 min )
    DEJA VU: Continual Model Generalization For Unseen Domains. (arXiv:2301.10418v1 [cs.LG])
    In real-world applications, deep learning models often run in non-stationary environments where the target data distribution continually shifts over time. There have been numerous domain adaptation (DA) methods in both online and offline modes to improve cross-domain adaptation ability. However, these DA methods typically only provide good performance after a long period of adaptation, and perform poorly on new domains before and during adaptation - in what we call the "Unfamiliar Period", especially when domain shifts happen suddenly and significantly. On the other hand, domain generalization (DG) methods have been proposed to improve the model generalization ability on unadapted domains. However, existing DG works are ineffective for continually changing domains due to severe catastrophic forgetting of learned knowledge. To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains. RaTP includes a training-free data augmentation module to prepare data for TDG, a novel pseudo-labeling mechanism to provide reliable supervision for TDA, and a prototype contrastive alignment algorithm to align different domains for achieving TDG, TDA and FA. Extensive experiments on Digits, PACS, and DomainNet demonstrate that RaTP significantly outperforms state-of-the-art works from Continual DA, Source-Free DA, Test-Time/Online DA, Single DG, Multiple DG and Unified DA&DG in TDG, and achieves comparable TDA and FA capabilities.  ( 2 min )
    Off-Policy Evaluation for Action-Dependent Non-Stationary Environments. (arXiv:2301.10330v1 [cs.LG])
    Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (passive non-stationarity), changes induced by interactions with the system itself (active non-stationarity), or both (hybrid non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a higher-order stationarity assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.  ( 2 min )
    Exact Fractional Inference via Re-Parametrization & Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms. (arXiv:2301.10369v1 [cs.LG])
    Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$-fractional-homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization approach of \cite{wainwright_tree-based_2002} and the loop series approach of \cite{chertkov_loop_2006}, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.
    On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems. (arXiv:2301.10587v1 [cs.SD])
    The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these practices on resource utilization and more importantly network performance is not well documented. This paper is an empirical study of the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
    Audience-Centric Natural Language Generation via Style Infusion. (arXiv:2301.10283v1 [cs.CL])
    Adopting contextually appropriate, audience-tailored linguistic styles is critical to the success of user-centric language generation systems (e.g., chatbots, computer-aided writing, dialog systems). While existing approaches demonstrate textual style transfer with large volumes of parallel or non-parallel data, we argue that grounding style on audience-independent external factors is innately limiting for two reasons. First, it is difficult to collect large volumes of audience-specific stylistic data. Second, some stylistic objectives (e.g., persuasiveness, memorability, empathy) are hard to define without audience feedback. In this paper, we propose the novel task of style infusion - infusing the stylistic preferences of audiences in pretrained language generation models. Since humans are better at pairwise comparisons than direct scoring - i.e., is Sample-A more persuasive/polite/empathic than Sample-B - we leverage limited pairwise human judgments to bootstrap a style analysis model and augment our seed set of judgments. We then infuse the learned textual style in a GPT-2 based text generator while balancing fluency and style adoption. With quantitative and qualitative assessments, we show that our infusion approach can generate compelling stylized examples with generic text prompts. The code and data are accessible at https://github.com/CrowdDynamicsLab/StyleInfusion.  ( 2 min )
    NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification. (arXiv:2108.06158v4 [cs.LG] UPDATED)
    Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning setting in which only a subset of instances are labeled as positive while the rest of the data set is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on ten different disease data sets using three machine learning algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms.
    Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding. (arXiv:2111.15363v2 [cs.CV] UPDATED)
    Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves \sota performance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts).
    Lightweight Neural Architecture Search for Temporal Convolutional Networks at the Edge. (arXiv:2301.10281v1 [cs.LG])
    Neural Architecture Search (NAS) is quickly becoming the go-to approach to optimize the structure of Deep Learning (DL) models for complex tasks such as Image Classification or Object Detection. However, many other relevant applications of DL, especially at the edge, are based on time-series processing and require models with unique features, for which NAS is less explored. This work focuses in particular on Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged as a promising alternative to more complex recurrent architectures. We propose the first NAS tool that explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive-field and number of features in each layer. The proposed approach searches for networks that offer good trade-offs between accuracy and number of parameters/operations, enabling an efficient deployment on embedded platforms. We test the proposed NAS on four real-world, edge-relevant tasks, involving audio and bio-signals. Results show that, starting from a single seed network, our method is capable of obtaining a rich collection of Pareto optimal architectures, among which we obtain models with the same accuracy as the seed, and 15.9-152x fewer parameters. Compared to three state-of-the-art NAS tools, ProxylessNAS, MorphNet and FBNetV2, our method explores a larger search space for TCNs (up to 10^12x) and obtains superior solutions, while requiring low GPU memory and search time. We deploy our NAS outputs on two distinct edge devices, the multicore GreenWaves Technology GAP8 IoT processor and the single-core STMicroelectronics STM32H7 microcontroller. With respect to the state-of-the-art hand-tuned models, we reduce latency and energy of up to 5.5x and 3.8x on the two targets respectively, without any accuracy loss.  ( 3 min )
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.  ( 2 min )
    Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute. (arXiv:2301.10448v1 [cs.CL])
    Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.  ( 2 min )
    One Model for All Domains: Collaborative Domain-Prefix Tuning for Cross-Domain NER. (arXiv:2301.10410v1 [cs.CL])
    Cross-domain NER is a challenging task to address the low-resource problem in practical scenarios. Previous typical solutions mainly obtain a NER model by pre-trained language models (PLMs) with data from a rich-resource domain and adapt it to the target domain. Owing to the mismatch issue among entity types in different domains, previous approaches normally tune all parameters of PLMs, ending up with an entirely new NER model for each domain. Moreover, current models only focus on leveraging knowledge in one general source domain while failing to successfully transfer knowledge from multiple sources to the target. To address these issues, we introduce Collaborative Domain-Prefix Tuning for cross-domain NER (CP-NER) based on text-to-text generative PLMs. Specifically, we present text-to-text generation grounding domain-related instructors to transfer knowledge to new domain NER tasks without structural modifications. We utilize frozen PLMs and conduct collaborative domain-prefix tuning to stimulate the potential of PLMs to handle NER tasks across various domains. Experimental results on the Cross-NER benchmark show that the proposed approach has flexible transfer ability and performs better on both one-source and multiple-source cross-domain NER tasks. Codes will be available in https://github.com/zjunlp/DeepKE/tree/main/example/ner/cross.  ( 2 min )
    Editing Language Model-based Knowledge Graph Embeddings. (arXiv:2301.10405v1 [cs.CL])
    Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, which are challenging to modify without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. The proposed task aims to enable data-efficient and fast updates to KG embeddings without damaging the performance of the rest. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hyper network to edit/add facts. Comprehensive experimental results demonstrate that KGEditor can perform better when updating specific facts while not affecting the rest with low training resources. Code and datasets will be available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.  ( 2 min )
    Designing Data: Proactive Data Collection and Iteration for Machine Learning. (arXiv:2301.10319v1 [cs.HC])
    Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track and manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model through Out-of-Distribution (OOD) methods. We instantiate designing data through our own data collection and applied ML case study. We find models trained on "designed" datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.  ( 2 min )
    Weakly Supervised Headline Dependency Parsing. (arXiv:2301.10371v1 [cs.CL])
    English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet.  ( 2 min )
    Exact and rapid linear clustering of networks with dynamic programming. (arXiv:2301.10403v1 [cs.SI])
    We study the problem of clustering networks whose nodes have imputed or physical positions in a single dimension, such as prestige hierarchies or the similarity dimension of hyperbolic embeddings. Existing algorithms, such as the critical gap method and other greedy strategies, only offer approximate solutions. Here, we introduce a dynamic programming approach that returns provably optimal solutions in polynomial time -- O(n^2) steps -- for a broad class of clustering objectives. We demonstrate the algorithm through applications to synthetic and empirical networks, and show that it outperforms existing heuristics by a significant margin, with a similar execution time.  ( 2 min )
    Predicting mental health using social media: A roadmap for future development. (arXiv:2301.10453v1 [cs.IR])
    Mental disorders such as depression and suicidal ideation are hazardous, affecting more than 300 million people over the world. However, on social media, mental disorder symptoms can be observed, and automated approaches are increasingly capable of detecting them. The considerable number of social media users and the tremendous quantity of user-generated data on social platforms provide a unique opportunity for researchers to distinguish patterns that correlate with mental status. This research offers a roadmap for analysis, where mental state detection can be based on machine learning techniques. We describe the common approaches for predicting and identifying the disorder using user-generated content. This research is organized according to the data collection, feature extraction, and prediction algorithms. Furthermore, we review several recent studies conducted to explore different features of candidate profiles and their analytical methods. Following, we debate various aspects of the development of experimental auto-detection frameworks for identifying users who suffer from disorders, and we conclude with a discussion of future trends. The introduced methods can help complement screening procedures, identify at-risk people through social media monitoring on a large scale, and make disorders easier to treat in the future.  ( 2 min )
    Parameterizing the cost function of Dynamic Time Warping with application to time series classification. (arXiv:2301.10350v1 [cs.LG])
    Dynamic Time Warping (DTW) is a popular time series distance measure that aligns the points in two series with one another. These alignments support warping of the time dimension to allow for processes that unfold at differing rates. The distance is the minimum sum of costs of the resulting alignments over any allowable warping of the time dimension. The cost of an alignment of two points is a function of the difference in the values of those points. The original cost function was the absolute value of this difference. Other cost functions have been proposed. A popular alternative is the square of the difference. However, to our knowledge, this is the first investigation of both the relative impacts of using different cost functions and the potential to tune cost functions to different tasks. We do so in this paper by using a tunable cost function {\lambda}{\gamma} with parameter {\gamma}. We show that higher values of {\gamma} place greater weight on larger pairwise differences, while lower values place greater weight on smaller pairwise differences. We demonstrate that training {\gamma} significantly improves the accuracy of both the DTW nearest neighbor and Proximity Forest classifiers.  ( 2 min )
    Data Consistent Deep Rigid MRI Motion Correction. (arXiv:2301.10365v1 [eess.IV])
    Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated from known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while retaining the benefits of explicit data consistency-based optimization. Our code is publicly available at https://www.github.com/nalinimsingh/neuroMoCo.  ( 2 min )
    Multilingual Multiaccented Multispeaker TTS with RADTTS. (arXiv:2301.10335v1 [cs.SD])
    We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.  ( 2 min )
    Generating Multidimensional Clusters With Support Lines. (arXiv:2301.10327v1 [cs.LG])
    Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.  ( 2 min )
    Score Matching via Differentiable Physics. (arXiv:2301.10250v1 [cs.LG])
    Diffusion models based on stochastic differential equations (SDEs) gradually perturb a data distribution $p(\mathbf{x})$ over time by adding noise to it. A neural network is trained to approximate the score $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ at time $t$, which can be used to reverse the corruption process. In this paper, we focus on learning the score field that is associated with the time evolution according to a physics operator in the presence of natural non-deterministic physical processes like diffusion. A decisive difference to previous methods is that the SDE underlying our approach transforms the state of a physical system to another state at a later time. For that purpose, we replace the drift of the underlying SDE formulation with a differentiable simulator or a neural network approximation of the physics. We propose different training strategies based on the so-called probability flow ODE to fit a training set of simulation trajectories and discuss their relation to the score matching objective. For inference, we sample plausible trajectories that evolve towards a given end state using the reverse-time SDE and demonstrate the competitiveness of our approach for different challenging inverse problems.  ( 2 min )
    Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction. (arXiv:2301.10309v1 [cs.LG])
    Crosslingual conditional generation (e.g., machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. A source query in one language, for instance, may yield several translation options in another language without any extra context. Only one translation could be acceptable however, depending on the translator's preferences and goals. Choosing the incorrect option might significantly affect translation usefulness and quality. We propose a novel method interactive-chain prompting -- a series of question, answering and generation intermediate steps between a Translator model and a User model -- that reduces translations into a list of subproblems addressing ambiguities and then resolving such subproblems before producing the final text to be translated. To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which leads to ambiguities at inference for four languages. To encourage further exploration in this direction, we release all datasets. We note that interactive-chain prompting, using eight interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities.  ( 2 min )
    Towards Robust Metrics for Concept Representation Evaluation. (arXiv:2301.10367v1 [cs.LG])
    Recent work on interpretability has focused on concept-based explanations, where deep learning models are explained in terms of high-level units of information, referred to as concepts. Concept learning models, however, have been shown to be prone to encoding impurities in their representations, failing to fully capture meaningful features of their inputs. While concept learning lacks metrics to measure such phenomena, the field of disentanglement learning has explored the related notion of underlying factors of variation in the data, with plenty of metrics to measure the purity of such factors. In this paper, we show that such metrics are not appropriate for concept learning and propose novel metrics for evaluating the purity of concept representations in both approaches. We show the advantage of these metrics over existing ones and demonstrate their utility in evaluating the robustness of concept representations and interventions performed on them. In addition, we show their utility for benchmarking state-of-the-art methods from both families and find that, contrary to common assumptions, supervision alone may not be sufficient for pure concept representations.  ( 2 min )
    Language Model Detoxification in Dialogue with Contextualized Stance Control. (arXiv:2301.10368v1 [cs.CL])
    To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. Different from the LM controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. Therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. We introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. The generated stance prefix is then combined with the toxicity control prefix to guide the response generation. Experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying LM.  ( 2 min )
    ClimaX: A foundation model for weather and climate. (arXiv:2301.10343v1 [cs.LG])
    Most state-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models of the atmosphere. These approaches aim to model the non-linear dynamics and complex interactions between multiple variables, which are challenging to approximate. Additionally, many such numerical models are computationally intensive, especially when modeling the atmospheric phenomenon at a fine-grained spatial and temporal resolution. Recent data-driven approaches based on machine learning instead aim to directly solve a downstream forecasting or projection task by learning a data-driven functional mapping using deep neural networks. However, these networks are trained using curated and homogeneous climate datasets for specific spatiotemporal tasks, and thus lack the generality of numerical models. We develop and demonstrate ClimaX, a flexible and generalizable deep learning model for weather and climate science that can be trained using heterogeneous datasets spanning different variables, spatio-temporal coverage, and physical groundings. ClimaX extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility. ClimaX is pre-trained with a self-supervised learning objective on climate datasets derived from CMIP6. The pre-trained ClimaX can then be fine-tuned to address a breadth of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal scales unseen during pretraining. Compared to existing data-driven baselines, we show that this generality in ClimaX results in superior performance on benchmarks for weather forecasting and climate projections, even when pretrained at lower resolutions and compute budgets.  ( 2 min )
    Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems. (arXiv:2301.10391v1 [cs.LG])
    We present a data-driven, space-time continuous framework to learn surrogatemodels for complex physical systems described by advection-dominated partialdifferential equations. Those systems have slow-decaying Kolmogorovn-widththat hinders standard methods, including reduced order modeling, from producinghigh-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compactrepresentation network. We leverage the expressive power of the network and aspecially designed consistency-inducing regularization to obtain latent trajectoriesthat are both low-dimensional and smooth. These properties render our surrogatemodels highly efficient at inference time. We show the efficacy of our frameworkby learning models that generate accurate multi-step rollout predictions at muchfaster inference speed compared to competitors, for several challenging examples.  ( 2 min )
    Learned Interferometric Imaging for the SPIDER Instrument. (arXiv:2301.10260v1 [astro-ph.IM])
    The Segmented Planar Imaging Detector for Electro-Optical Reconnaissance (SPIDER) is an optical interferometric imaging device that aims to offer an alternative to the large space telescope designs of today with reduced size, weight and power consumption. This is achieved through interferometric imaging. State-of-the-art methods for reconstructing images from interferometric measurements adopt proximal optimization techniques, which are computationally expensive and require handcrafted priors. In this work we present two data-driven approaches for reconstructing images from measurements made by the SPIDER instrument. These approaches use deep learning to learn prior information from training data, increasing the reconstruction quality, and significantly reducing the computation time required to recover images by orders of magnitude. Reconstruction time is reduced to ${\sim} 10$ milliseconds, opening up the possibility of real-time imaging with SPIDER for the first time. Furthermore, we show that these methods can also be applied in domains where training data is scarce, such as astronomical imaging, by leveraging transfer learning from domains where plenty of training data are available.  ( 2 min )
  • Open

    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    Posterior Covariance Information Criterion for Weighted Inference. (arXiv:2106.13694v4 [stat.ME] UPDATED)
    For predictive evaluation based on quasi-posterior distributions, we develop a new information criterion, the posterior covariance information criterion (PCIC. PCIC generalises the widely applicable information criterion WAIC so as to effectively handle predictive scenarios where likelihoods for the estimation and the evaluation of the model may be different. A typical example of such scenarios is the weighted likelihood inference, including prediction under covariate shift and counterfactual prediction. The proposed criterion utilises a posterior covariance form and is computed by using only one Markov chain Monte Carlo run. Through numerical examples, we demonstrate how PCIC can apply in practice. Further, we show that PCIC is asymptotically unbiased to the quasi-Bayesian generalization error under mild conditions in weighted inference with both regular and singular statistical models.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.
    Semiparametric discrete data regression with Monte Carlo inference and prediction. (arXiv:2110.12316v5 [stat.ME] UPDATED)
    Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over- or under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesian modeling and algorithmic framework that enables semiparametric regression analysis for discrete data with Monte Carlo (not MCMC) sampling. The proposed approach pairs a nonparametric marginal model with a latent linear regression model to encourage both flexibility and interpretability, and delivers posterior consistency even under model misspecification. For a parametric or large-sample approximation of this model, we identify a class of conjugate priors with (pseudo) closed-form posteriors. All posterior and predictive distributions are available analytically or via Monte Carlo sampling. These tools are broadly useful for linear regression, nonlinear models via basis expansions, and variable selection with discrete data. Simulation studies demonstrate significant advantages in computing, prediction, estimation, and selection relative to existing alternatives. This novel approach is applied to self-reported mental health data that exhibit zero-inflation, overdispersion, boundedness, and heaping.
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.

  • Open

    Insane face rendering A.I technology..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    AI Music - Eminem - 'Slim shady is alive'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    Finding the right AI for a specific task
    Hi all, We're developing an internal application that groups customers together based on attributes that adhere to a ruleset on how they should be grouped. It does this fine. However, some nuance is then applied via human effort to modify groupings based on some customer notes (a text string) that sometimes dictate that two customers need to be in different groups for x reason, even if the original grouping adheres to the ruleset. The application itself has a UI that sorts customers into columns, which are manipulated by staff via dragging and dropping a customer/customers between one column and another. I had a thought to employ an AI model that compares the original generated grouping config that our code produces against the modified groupings that staff adjust based on that nuance. The idea is that we could analyze the why of a modification and use that insight to generate better default groupings. Is there a model out there that would be ideally suited for this kind of learning? Keen to dive into it further on my own but any recommendations as a starting point would be great. submitted by /u/premiumnougat [link] [comments]  ( 41 min )
    Create Your Chat GPT-3 Web App with Streamlit in Python
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    What do employers and job seekers need to know about artificial intelligence's role in hiring?
    University of Florida - Warrington College of Business's Mo Wang offers advice for the future of work. Full Story: https://explore.research.ufl.edu/the-future-of-work.html#ai-hiring submitted by /u/ufexplore [link] [comments]  ( 40 min )
    BuzzFeed to Use ChatGPT Creator OpenAI to Help Create Quizzes and Other Content
    submitted by /u/trueslicky [link] [comments]  ( 40 min )
    Member of Congress Reads AI-Generated Speech on House Floor
    submitted by /u/dahmedahe [link] [comments]  ( 6 min )
    Synthesizing the Businessmen-Smile
    submitted by /u/walt74 [link] [comments]  ( 45 min )
    AI Dream 150 - ENTERING DREAMWORLD Part2 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    NIST Risk Management Framework Aims to Improve Trustworthiness of Artificial Intelligence
    submitted by /u/Harley109 [link] [comments]  ( 40 min )
    📌[Searchcolab] It's impressive to see how far generative AI has come in the past 5 years. What should we expect the trajectory of this field to be in the next 5 years? Btw the pictures attached are also generated using AI.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    "Father" from Equilibrium movie
    Imagine we could create an AGI/ASI that will protect our values. Something like "Father" from Equilibrium (but that's a dystopian version). Let's call it modern God. What values should it protect? submitted by /u/chuguruk [link] [comments]  ( 40 min )
    InstructPix2Pix lets you edit images using only text prompts
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    what are your favorite AI subreddit?
    submitted by /u/Chaserivx [link] [comments]  ( 40 min )
    AI "Upscale" With Only 1000 Training Examples(All examples were dogs)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 43 min )
    Proud Pollution A movie script Written by AI
    Movie script: Proud Pollution The website used- (https://www.plot-generator.org.uk/) Be free to share your thoughts on it! ​ Proud Pollution A Screenplay by Mr. Pseudonym EXT. VASQUEZ ROCKS, CALIFORNIA - AFTERNOON Misunderstood piolet FLAMOUS JACK THORNTON is arguing with mean scout MISS HELEN FISH. JACK tries to hug HELEN but she shakes him off. JACK Please, Helen, don't leave me. HELEN I'm sorry Jack, but I'm looking for somebody a bit braver. Somebody who faces his fears head on, inhead-onstead of running away. JACK I am such a person! HELEN frowns. HELEN I'm sorry, Jack. I just don't feel excited by this relationship anymore. HELEN leaves. JACK sits down, looking defeated. Moments later, noble navigator MASTER CUTHBERT MACDONALD barges in looking flustered. JACK Go…  ( 47 min )
    Meta's chief AI scientist says "ChatGPT is not innovative".
    What happened? So, there has been a lot of excitement around OpenAI's ChatGPT which generates natural-language responses to human prompts. But what if... it's not as amazing as we all think it is? Yann LeCun, Meta's chief AI scientist, argues that the program is not innovative. He also states that similar technology has been developed by many companies and research labs, and that ChatGPT is composed of multiple pieces of technology developed over many years by many parties. (sounds salty to me... and I like my cookies sweet!) But maybe Yann has a point! What's happening now? ChatGPT is perceived by many as a unique and innovative program. People are using it everyday to make their lives easier. So, no matter what, ChatGPT is still awesome. What's happening next? It's unclear what will happen next in terms of the development and perception of ChatGPT. However, it can be expected that as AI technology continues to evolve, there will be further advancements on what is possible with this tech. It's likely that ChatGPT will face many spin-offs and competitors this year. If you enjoyed this and want 500+ AI tools, I write a daily AI newsletter: https://chriscookies.beehiiv.com/p/metas-chief-ai-scientist-says-chatgpt-not-innovative-7581 submitted by /u/ZaKodiak [link] [comments]  ( 41 min )
    Looking for some ideas to research about making money using AI!
    Hey yall, I'm looking for ideas of how to make money using language models like ChatGPT. I want to go in depth and research a bit as well as begin designing tutorials based on my experiences of what's the best way to make money. I am open to any suggestions or things that have worked for you all! Thanks! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    What is Atomic AI? - Is AI going to have a Drug development breakthrough soon?
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    AI Video to Fill Missing Frames/Smooth Animation?
    Hey all, Was wondering if you know some kind of AI tool that exists to fill missing frames and therefore smooth animation in animated videos? Trying to get something cleaned up and really nailed in. Thanks! submitted by /u/miseryleech [link] [comments]  ( 40 min )
    ChatGPT: OpenAI’s Last Resort Turns Out To Be A Winner
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    41 AI Written Articles Out Of 77 On CNET Have Plagiarism And Errors
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 6 min )
    Chrome Extension that uses AI to write emails.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I told an AI to freak out on camera. it was ALL made by AI.
    submitted by /u/25dopren [link] [comments]  ( 40 min )
  • Open

    [D] score based vs. Diffusion models
    I know there is a mathematical way to show that the two approaches of score matching models and diffusion models are the same. I wonder, if there in practice/code are the same either? I already tried to find some PyTorch implementations of score based models but didn’t find anything yet - just for diffusion models. submitted by /u/Individual-Cause-616 [link] [comments]  ( 43 min )
    [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation?
    I guess what I'm trying to figure out is, what are the main reasons that DMs are outperforming GANs in text2img generation? Thanks! submitted by /u/TheCockatoo [link] [comments]  ( 46 min )
    [P] A python module to generate optimized prompts & solve different NLP problems using GPT-n based models and return structured python object for easy parsing
    Hi folks, I was working on a personal experimental project related to GPT-3, which I thought of making it open source now. It saves much time while working with LLMs. If you are an industrial researcher or application developer, you probably have worked with GPT-3 apis. A common challenge when utilizing LLMs such as GPT-3 and BLOOM is their tendency to produce uncontrollable & unstructured outputs, making it difficult to use them for various NLP tasks and applications.To address this, we developed Promptify, a library that allows for the use of LLMs to solve NLP problems including Named Entity Recognition, Binary Classification, Multi-Label Classification, and Question-Answering and return a python object for easy parsing to construct additional applications on top of GPT-n based models. Features 🚀 🧙‍♀️ NLP Tasks (NER, Binary Text Classification, Multi-Label Classification etc) in 2 lines of code with no training data required 🔨 Easily add one shot, two shot, or few shot examples to the prompt ✌ Output always provided as a Python object (e.g. list, dictionary) for easy parsing and filtering 💥 Custom examples and samples can be easily added to the prompt 💰 Optimized prompts to reduce OpenAI token costs GITHUB: https://github.com/promptslab/Promptify Examples: https://github.com/promptslab/Promptify/tree/main/examples For quick demo -> Colab I hope it will be helpful in your research. Thanks :) NER example ​ https://preview.redd.it/vnz4mf0i6gea1.png?width=1398&format=png&auto=webp&s=74c70bd9d518423f913c1fb9c68cf2565cf8cffc submitted by /u/aadityaura [link] [comments]  ( 43 min )
    A Watermark for Large Language Models
    submitted by /u/lookinsidemybutthole [link] [comments]  ( 42 min )
    [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
    Dec 2022 paper from Microsoft research: https://arxiv.org/abs/2212.10559v2 Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. submitted by /u/currentscurrents [link] [comments]  ( 43 min )
    [Discussion] Github like alternative for ML?
    Versioning and collaboration on code for software engineers is a reasonably solved problem through GitHub since the task at hand predominantly involves just maintaining different copies of just simple vanilla code in different folders. On the other hand, ML engineers face the humungous task of maintaining different versions on not just code, but hyper parameters, data, models, data lineage and labels and storing this on GitHub currently does not allow you to track the changes on each variable well. What are the software/open source tools currently used for the same? Is their a space for a new company to be built here? submitted by /u/angkhandelwal749 [link] [comments]  ( 44 min )
    Are there any projects working at an open source version of Constitutional AI? [D]
    I'm looking into projects which augment the RLHF training approach of chatGPT with explicit rules, such as in https://paperswithcode.com/paper/constitutional-ai-harmlessness-from-ai. Ideally there would be both rules and priority levels between the rules, similarly to the Asimov laws of robotics. The Open-Assistant project (https://github.com/LAION-AI/Open-Assistant) captures the spirit, but it is looking to replicate chatGPT at the moment. submitted by /u/lorepieri [link] [comments]  ( 42 min )
    [D] Quantitative measure for smoothness of NLP autoencoder latent space
    I would like to measure the smoothness of an NLP-autoencoder's latent space. The idea is to sample two Gaussian vectors v1 and v2 in the latent space of the AE, and generate N-1 points between them like so: vi = v1 + (v2 - v1) / (N * i) My idea is to then decode these vectors and measure the BLEU score between d(vi) and d(vi+1) for all N-2 comparisons. Is this idea reasonable, do you have a better one? Is there a technique from AEs with images that can be useful here? submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    [D] What are some of your favorite ML research posters?
    And what are your own best practices when creating one (e.g. adding a QR code that links to the GitHub project or paper PDF)? submitted by /u/epistoteles [link] [comments]  ( 42 min )
    [D] Fastest and most accurate model for casing
    What is the state of the art regarding freely available casing models, i.e. DNNs, that try to restore the original casing of a text with uniform (either lowercase or capital letters) casing? I value both speed and accuracy, as I have to process a large corpus of text. submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    Few questions about scalability of chatGPT [D]
    I have two questions about chatGPT. I don't come from a machine learning background. I am just a programmer. So bear with me if they sound a bit dumb. I was checking about chatGPT a bit the last week. I went through their papers and also tried out a fine tuning by myself by creating some fictional world and giving it some examples. The first thing I wondered is what is very special about the model than the large data and parameter set it has, that other competitors can't do. I ask this because I have seen a lot of "google killer" discussions in some places. From what I understood from their papers I thought it is something another company with the computing power and the filtered data can have up and running in few months. I see their advantage in rolling out to the public because with feedbacks from actual users all over the world it can potentially be retrained. The second thing I wondered is its scalability. It feels to me that it is a very big challenge to keep it scalable in the future. Currently getting a long text out of it is kind of painful because it has to continuously generate. I think it is continuously calculating with the huge parameter set it has. I wonder also about new trends, if it needs to be retrained. I also used it for a fine tuning, where I created a fictional world with its own law and rules and the fine tuning took hours in the queue - so is it creating separate parameters for my case? that would be a lot considering how much parameter set they have. submitted by /u/besabestin [link] [comments]  ( 50 min )
    [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II.
    Find the release notes here: https://github.com/nnaisense/evotorch/releases/tag/v0.4.0 A big highlight is how fast these implementations are! I genuinely believe GPU-acceleration is the future of Evolutionary algorithms, and EvoTorch and its integration into the PyTorch ecosystem is a fantastic enabler for this. To demonstrate the raw speed provided by the new release, I compared EvoTorch's CMA-ES implementation to that provided by the popular pycma package on the 80-dimensional Rastrigin problem and tracked the run-time: Performance was measured over 50 runs on the 80-dimensional Rastrigin problem The crazy thing to note is that when we switch to GPU (Tesla V100), we can efficiently run CMA-ES with population sizes going into 100k+! submitted by /u/NaturalGradient [link] [comments]  ( 45 min )
    Machine learning and black box numerical solver[D]
    Anybody know some methods and techniques for integrating a numerical solver with the neural network .. how do you calculate the gradients of the solver when you don’t know the details of such solver- black box solver. submitted by /u/Due-Wall-915 [link] [comments]  ( 43 min )
    [P] Diffusion models best practices
    I'm about to start an experimental project that involves training a denoising diffusion model on the medical data (small dataset). Could you please share useful resources, tips, tricks and heuristics for dealing with diffusion models? submitted by /u/debrises [link] [comments]  ( 42 min )
  • Open

    "Cheaters Hacked an AI Bot—and Beat the 'Rocket League' Elite"
    submitted by /u/gwern [link] [comments]  ( 40 min )
    Insights and learnings
    Hey all, I am part of an incubator and interested in building developer tooling for reinforcement learning. I would love to understand, from the RL community, what some of the biggest pinpoints are in developing and productionising RL agents. Would love to hear about your implementations too, if you are happy to share! submitted by /u/paramkumar1992 [link] [comments]  ( 41 min )
    Are there papers that do an empirical investigation on DRL hyperparameters?
    Could someone please help with this - https://ai.stackexchange.com/questions/38894/are-there-papers-that-do-an-empirical-investigation-on-drl-hyperparameters submitted by /u/Academic-Rent7800 [link] [comments]  ( 41 min )
    DQN application
    I want to train a DQN model in an off-policy fashion, where my behavior policy is an older agent. I have a big memory of a lot of episodes of this agent. Now I want to find a better policy using DQN. Now I am just wondering, in the "normal" DQN case you would use the experience replay buffer and would update behavior and target policy online (behavior not really online but with the time lag introduced after which these parameters are also updated). In my case, I already have all the experience and would like to learn from it. Do you think it makes sense to use the exact same procedure in this context, so sampling one new action, state and immediate reward, follow up action or could it be better here to use the fact that all experience is already stored to exchange the immediate reward + gamma*maxQ(s',a') with some more future information about the rewards (up to the point of Monte Carlo where you take G_t so the discounted cumulative reward seen during the episode from point t onwards)? submitted by /u/PatrickSVM [link] [comments]  ( 42 min )
    "Imitating Human Behaviour with Diffusion Models", Pearce et al 2023 {MS}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning", Wang et al 2022 {Twitter}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Learning with Queried Hints" [on "Online Learning and Bandits with Queried Hints", Bhaskara et al 2022 {G}]
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Why can't my neural network recognize my own digits, but it has 97% accuracy on mnist test samples?
    I've been learning about neural networks for the past few days and I decided to write my own in Python. To keep it simple, I didn't write code for convolution layers and such, only fully connected layers with logistic activation function. I first trained it to do XOR with a 2 -> 2 -> 1 layers layout and it worked. Then I tried to train a 28*28 -> 100 -> 10 network on the MNIST digit dataset to recognize digits. When running it on the test samples, the accuracy was 97%, but when running on my own samples it barely ever manages to get it right. Does anyone have any idea on why this would happen? submitted by /u/FidgetSpinzz [link] [comments]  ( 42 min )
    Why add bias instead of subtracting bias?
    Pretty much the title, why do we add the bias instead of subtracting? Also when i watched 3blue1browns video about neural networks nad he said that you subtract the with the bias, but other sources tell me or explain that you simply add the bias in the dot product instead of subtracting. //Newbie submitted by /u/ArthurLCTTheCool [link] [comments]  ( 41 min )
    Machine Learning Framework with Neural Networks for Java
    Hey guys, a buddy of mine and me created a Framework for Machine Learning in Java. It provides the possibilities to train the neural Networks via backpropagation but we also implemented a Genetic Algorithm which can also be used for the training. The project is intended to be used by people who only learn Java in school and want to try out ML without the need of learning Python or complex Java Libraries. It's designed to be easy to use and to be played around with. Qualifications needed to use: basic Java understanding A brief understanding of what Machine Learning is Here is a step-by-step tutorial how to predict diabetes with this framework: https://easy-ml.gitbook.io/easy-ml-for-java/fundamentals/implement-your-first-ai (This example is using the genetic Algorithm, there is already one example in the source code published using the Backpropagation approach but the tutorial for it is gonna follow in the next few days) Please also look at the GitHub repository and leave some feedback about code and design. (Especially considering the ReadMe) https://github.com/tomLamprecht/Easy-ML-For-Java https://easy-ml.gitbook.io/easy-ml-for-java/ (Doc) Right now, I'm working on adding Convolutional Neural Networks as well. Feel free to also check open issues on our GitHub if you want to contribute! :) Thanks so much, and also especially for those who contributed to our project with pull requests. PS: we earn no cent with this project, and we just do it for the experience. So feedback is basically our payment :D (and ofc stars on GitHub hehe) submitted by /u/Lampard557 [link] [comments]  ( 42 min )
  • Open

    Best Egg achieved three times faster ML model training with Amazon SageMaker Automatic Model Tuning
    This post is co-authored by Tristan Miller from Best Egg. Best Egg is a leading financial confidence platform that provides lending products and resources focused on helping people feel more confident as they manage their everyday finances. Since March 2014, Best Egg has delivered $22 billion in consumer personal loans with strong credit performance, welcomed […]  ( 8 min )
  • Open

    What Are Large Language Models Used For?
    AI applications are summarizing articles, writing stories and engaging in long conversations — and large language models are doing the heavy lifting. A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Large language models Read article >  ( 7 min )
    DLSS 3 Delivers Ultimate Boost in Latest Game Updates on GeForce NOW
    GeForce NOW RTX 4080 SuperPODs are rolling out now, bringing RTX 4080-class performance and features to Ultimate members — including support for NVIDIA Ada Lovelace GPU architecture technologies like NVIDIA DLSS 3.  This GFN Thursday brings updates to some of GeForce NOW’s hottest games that take advantage of these amazing technologies, all from the cloud. Read article >  ( 6 min )
  • Open

    Remove algorithmic filters from what you read
    I typically announce new blog posts from my most relevant twitter account: data science from @DataSciFact, algebra and miscellaneous math from @AlgebraFact, TeX and typography from @TeXtip, etc. If you’d like to be sure that you’re notified of each post, regardless of what algorithms Twitter applies to your feed, you can subscribe to this blog […] Remove algorithmic filters from what you read first appeared on John D. Cook.  ( 5 min )
  • Open

    Robustness through Data Augmentation Loss Consistency. (arXiv:2110.11205v3 [cs.LG] UPDATED)
    While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. For example, dialog state covaries with named entity when we augment data with a new named entity. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Our code of all experiments is available at: https://github.com/optimization-for-data-driven-science/DAIR.git  ( 3 min )
    Neuronal architecture extracts statistical temporal patterns. (arXiv:2301.10203v1 [q-bio.NC])
    Neuronal systems need to process temporal signals. We here show how higher-order temporal (co-)fluctuations can be employed to represent and process information. Concretely, we demonstrate that a simple biologically inspired feedforward neuronal model is able to extract information from up to the third order cumulant to perform time series classification. This model relies on a weighted linear summation of synaptic inputs followed by a nonlinear gain function. Training both - the synaptic weights and the nonlinear gain function - exposes how the non-linearity allows for the transfer of higher order correlations to the mean, which in turn enables the synergistic use of information encoded in multiple cumulants to maximize the classification accuracy. The approach is demonstrated both on a synthetic and on real world datasets of multivariate time series. Moreover, we show that the biologically inspired architecture makes better use of the number of trainable parameters as compared to a classical machine-learning scheme. Our findings emphasize the benefit of biological neuronal architectures, paired with dedicated learning algorithms, for the processing of information embedded in higher-order statistical cumulants of temporal (co-)fluctuations.  ( 2 min )
    Gradient-adjusted Incremental Target Propagation Provides Effective Credit Assignment in Deep Neural Networks. (arXiv:2102.11598v3 [cs.LG] UPDATED)
    Many of the recent advances in the field of artificial intelligence have been fueled by the highly successful backpropagation of error (BP) algorithm, which efficiently solves the credit assignment problem in artificial neural networks. However, it is unlikely that BP is implemented in its usual form within biological neural networks, because of its reliance on non-local information in propagating error gradients. Since biological neural networks are capable of highly efficient learning and responses from BP trained models can be related to neural responses, it seems reasonable that a biologically viable approximation of BP underlies synaptic plasticity in the brain. Gradient-adjusted incremental target propagation (GAIT-prop or GP for short) has recently been derived directly from BP and has been shown to successfully train networks in a more biologically plausible manner. However, so far, GP has only been shown to work on relatively low-dimensional problems, such as handwritten-digit recognition. This work addresses some of the scaling issues in GP and shows it to perform effective multi-layer credit assignment in deeper networks and on the much more challenging ImageNet dataset.  ( 2 min )
    Denoising Diffusion Probabilistic Models for Generation of Realistic Fully-Annotated Microscopy Image Data Sets. (arXiv:2301.10227v1 [eess.IV])
    Denoising diffusion probabilistic models have shown great potential in generating realistic image data. We show how those models can be used to generate realistic microscopy image data in 2D and 3D based on simulated sketches of cellular structures. Multiple data sets are used as an inspiration to simulate sketches of different cellular structures, allowing to generate fully-annotated image data sets without requiring human interactions. Those data sets are used to train segmentation approaches and demonstrate that annotation-free segmentation of cellular structures in fluorescence microscopy image data can be achieved, thereby leaping towards the ultimate goal of eliminating the necessity of human annotation efforts.  ( 2 min )
    CovidRhythm: A Deep Learning Model for Passive Prediction of Covid-19 using Biobehavioral Rhythms Derived from Wearable Physiological Data. (arXiv:2301.10168v1 [eess.SP])
    To investigate whether a deep learning model can detect Covid-19 from disruptions in the human body's physiological (heart rate) and rest-activity rhythms (rhythmic dysregulation) caused by the SARS-CoV-2 virus. We propose CovidRhythm, a novel Gated Recurrent Unit (GRU) Network with Multi-Head Self-Attention (MHSA) that combines sensor and rhythmic features extracted from heart rate and activity (steps) data gathered passively using consumer-grade smart wearable to predict Covid-19. A total of 39 features were extracted (standard deviation, mean, min/max/avg length of sedentary and active bouts) from wearable sensor data. Biobehavioral rhythms were modeled using nine parameters (mesor, amplitude, acrophase, and intra-daily variability). These features were then input to CovidRhythm for predicting Covid-19 in the incubation phase (one day before biological symptoms manifest). A combination of sensor and biobehavioral rhythm features achieved the highest AUC-ROC of 0.79 [Sensitivity = 0.69, Specificity=0.89, F$_{0.1}$ = 0.76], outperforming prior approaches in discriminating Covid-positive patients from healthy controls using 24 hours of historical wearable physiological. Rhythmic features were the most predictive of Covid-19 infection when utilized either alone or in conjunction with sensor features. Sensor features predicted healthy subjects best. Circadian rest-activity rhythms that combine 24h activity and sleep information were the most disrupted. CovidRhythm demonstrates that biobehavioral rhythms derived from consumer-grade wearable data can facilitate timely Covid-19 detection. To the best of our knowledge, our work is the first to detect Covid-19 using deep learning and biobehavioral rhythms features derived from consumer-grade wearable data.  ( 2 min )
    Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis. (arXiv:2301.10183v1 [cs.SD])
    Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations.  ( 2 min )
    On the Tradeoff between Energy, Precision, and Accuracy in Federated Quantized Neural Networks. (arXiv:2111.07911v3 [cs.LG] UPDATED)
    Deploying federated learning (FL) over wireless networks with resource-constrained devices requires balancing between accuracy, energy efficiency, and precision. Prior art on FL often requires devices to train deep neural networks (DNNs) using a 32-bit precision level for data representation to improve accuracy. However, such algorithms are impractical for resource-constrained devices since DNNs could require execution of millions of operations. Thus, training DNNs with a high precision level incurs a high energy cost for FL. In this paper, a quantized FL framework, that represents data with a finite level of precision in both local training and uplink transmission, is proposed. Here, the finite level of precision is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with the quantization are rigorously derived. An energy minimization problem is formulated with respect to the level of precision while ensuring convergence. To solve the problem, we first analytically derive the FL convergence rate and use a line search method. Simulation results show that our FL framework can reduce energy consumption by up to 53% compared to a standard FL model. The results also shed light on the tradeoff between precision, energy, and accuracy in FL over wireless networks.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization. (arXiv:2301.10133v1 [cs.LG])
    We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $\alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not. This sign-conscious algorithm is aware of whether from the previous step to the current one the update of each parameter has been too large or too small and adjusts the $\alpha$ accordingly. We implement the Active version (ours) of widely used and recently published gradient descent optimizers, namely SGD with momentum, AdamW, RAdam, and AdaBelief. Our experiments on ImageNet, CIFAR-10, WikiText-103, WikiText-2, and PASCAL VOC using different model architectures, such as ResNet and Transformers, show an increase in generalizability and training set fit, and decrease in training time for the Active variants of the tested optimizers. The results also show robustness of the Active variant of these optimizers to different values of the initial learning rate. Furthermore, the detrimental effects of using large mini-batch sizes are mitigated. ActiveLR, thus, alleviates the need for hyper-parameter search for two of the most commonly tuned hyper-parameters that require heavy time and computational costs to pick. We encourage AI researchers and practitioners to use the Active variant of their optimizer of choice for faster training, better generalizability, and reducing carbon footprint of training deep neural networks.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Analysis of Arrhythmia Classification on ECG Dataset. (arXiv:2301.10174v1 [cs.LG])
    The heart is one of the most vital organs in the human body. It supplies blood and nutrients in other parts of the body. Therefore, maintaining a healthy heart is essential. As a heart disorder, arrhythmia is a condition in which the heart's pumping mechanism becomes aberrant. The Electrocardiogram is used to analyze the arrhythmia problem from the ECG signals because of its fewer difficulties and cheapness. The heart peaks shown in the ECG graph are used to detect heart diseases, and the R peak is used to analyze arrhythmia disease. Arrhythmia is grouped into two groups - Tachycardia and Bradycardia for detection. In this paper, we discussed many different techniques such as Deep CNNs, LSTM, SVM, NN classifier, Wavelet, TQWT, etc., that have been used for detecting arrhythmia using various datasets throughout the previous decade. This work shows the analysis of some arrhythmia classification on the ECG dataset. Here, Data preprocessing, feature extraction, classification processes were applied on most research work and achieved better performance for classifying ECG signals to detect arrhythmia. Automatic arrhythmia detection can help cardiologists make the right decisions immediately to save human life. In addition, this research presents various previous research limitations with some challenges in detecting arrhythmia that will help in future research.  ( 2 min )
    Sleep Activity Recognition and Characterization from Multi-Source Passively Sensed Data. (arXiv:2301.10156v1 [eess.SP])
    Sleep constitutes a key indicator of human health, performance, and quality of life. Sleep deprivation has long been related to the onset, development, and worsening of several mental and metabolic disorders, constituting an essential marker for preventing, evaluating, and treating different health conditions. Sleep Activity Recognition methods can provide indicators to assess, monitor, and characterize subjects' sleep-wake cycles and detect behavioral changes. In this work, we propose a general method that continuously operates on passively sensed data from smartphones to characterize sleep and identify significant sleep episodes. Thanks to their ubiquity, these devices constitute an excellent alternative data source to profile subjects' biorhythms in a continuous, objective, and non-invasive manner, in contrast to traditional sleep assessment methods that usually rely on intrusive and subjective procedures. A Heterogeneous Hidden Markov Model is used to model a discrete latent variable process associated with the Sleep Activity Recognition task in a self-supervised way. We validate our results against sleep metrics reported by tested wearables, proving the effectiveness of the proposed approach and advocating its use to assess sleep without more reliable sources.  ( 2 min )
    EEG Opto-processor: epileptic seizure detection using diffractive photonic computing units. (arXiv:2301.10167v1 [eess.SP])
    Electroencephalography (EEG) analysis extracts critical information from brain signals, which has provided fundamental support for various applications, including brain-disease diagnosis and brain-computer interface. However, the real-time processing of large-scale EEG signals at high energy efficiency has placed great challenges for electronic processors on edge computing devices. Here, we propose the EEG opto-processor based on diffractive photonic computing units (DPUs) to effectively process the extracranial and intracranial EEG signals and perform epileptic seizure detection. The signals of EEG channels within a second-time window are optically encoded as inputs to the constructed diffractive neural networks for classification, which monitors the brain state to determine whether it's the symptom of an epileptic seizure or not. We developed both the free-space and integrated DPUs as edge computing systems and demonstrated their applications for real-time epileptic seizure detection with the benchmark datasets, i.e., the CHB-MIT extracranial EEG dataset and Epilepsy-iEEG-Multicenter intracranial EEG dataset, at high computing performance. Along with the channel selection mechanism, both the numerical evaluations and experimental results validated the sufficient high classification accuracies of the proposed opto-processors for supervising the clinical diagnosis. Our work opens up a new research direction of utilizing photonic computing techniques for processing large-scale EEG signals in promoting its broader applications.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    VaiPhy: a Variational Inference Based Algorithm for Phylogeny. (arXiv:2203.01121v3 [q-bio.PE] UPDATED)
    Phylogenetics is a classical methodology in computational biology that today has become highly relevant for medical investigation of single-cell data, e.g., in the context of cancer development. The exponential size of the tree space is, unfortunately, a substantial obstacle for Bayesian phylogenetic inference using Markov chain Monte Carlo based methods since these rely on local operations. And although more recent variational inference (VI) based methods offer speed improvements, they rely on expensive auto-differentiation operations for learning the variational parameters. We propose VaiPhy, a remarkably fast VI based algorithm for approximate posterior inference in an augmented tree space. VaiPhy produces marginal log-likelihood estimates on par with the state-of-the-art methods on real data and is considerably faster since it does not require auto-differentiation. Instead, VaiPhy combines coordinate ascent update equations with two novel sampling schemes: (i) SLANTIS, a proposal distribution for tree topologies in the augmented tree space, and (ii) the JC sampler, to the best of our knowledge, the first-ever scheme for sampling branch lengths directly from the popular Jukes-Cantor model. We compare VaiPhy in terms of density estimation and runtime. Additionally, we evaluate the reproducibility of the baselines. We provide our code on GitHub: \url{https://github.com/Lagergren-Lab/VaiPhy}.  ( 2 min )
    Lowering Detection in Sport Climbing Based on Orientation of the Sensor Enhanced Quickdraw. (arXiv:2301.10164v1 [eess.SP])
    Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence become practical in terms of expenses and time consumption for replacement when using in large quantity in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect sensors' orientation patterns during lowering different routes, and develop an supervised approach to identify lowering.  ( 2 min )
    Sequential Graph Attention Learning for Predicting Dynamic Stock Trends (Student Abstract). (arXiv:2301.10153v1 [q-fin.ST])
    The stock market is characterized by a complex relationship between companies and the market. This study combines a sequential graph structure with attention mechanisms to learn global and local information within temporal time. Specifically, our proposed "GAT-AGNN" module compares model performance across multiple industries as well as within single industries. The results show that the proposed framework outperforms the state-of-the-art methods in predicting stock trends across multiple industries on Taiwan Stock datasets.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    Computational Solar Energy -- Ensemble Learning Methods for Prediction of Solar Power Generation based on Meteorological Parameters in Eastern India. (arXiv:2301.10159v1 [cs.LG])
    The challenges in applications of solar energy lies in its intermittency and dependency on meteorological parameters such as; solar radiation, ambient temperature, rainfall, wind-speed etc., and many other physical parameters like dust accumulation etc. Hence, it is important to estimate the amount of solar photovoltaic (PV) power generation for a specific geographical location. Machine learning (ML) models have gained importance and are widely used for prediction of solar power plant performance. In this paper, the impact of weather parameters on solar PV power generation is estimated by several Ensemble ML (EML) models like Bagging, Boosting, Stacking, and Voting for the first time. The performance of chosen ML algorithms is validated by field dataset of a 10kWp solar PV power plant in Eastern India region. Furthermore, a complete test-bed framework has been designed for data mining as well as to select appropriate learning models. It also supports feature selection and reduction for dataset to reduce space and time complexity of the learning models. The results demonstrate greater prediction accuracy of around 96% for Stacking and Voting EML models. The proposed work is a generalized one and can be very useful for predicting the performance of large-scale solar PV power plants also.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution. (arXiv:2211.17246v2 [cs.LG] UPDATED)
    Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
    Multi-Agent Patrolling with Battery Constraints through Deep Reinforcement Learning. (arXiv:2212.08230v2 [cs.AI] UPDATED)
    Autonomous vehicles are suited for continuous area patrolling problems. However, finding an optimal patrolling strategy can be challenging for many reasons. Firstly, patrolling environments are often complex and can include unknown environmental factors. Secondly, autonomous vehicles can have failures or hardware constraints, such as limited battery life. Importantly, patrolling large areas often requires multiple agents that need to collectively coordinate their actions. In this work, we consider these limitations and propose an approach based on model-free, deep multi-agent reinforcement learning. In this approach, the agents are trained to automatically recharge themselves when required, to support continuous collective patrolling. A distributed homogeneous multi-agent architecture is proposed, where all patrolling agents execute identical policies locally based on their local observations and shared information. This architecture provides a fault-tolerant and robust patrolling system that can tolerate agent failures and allow supplementary agents to be added to replace failed agents or to increase the overall patrol performance. The solution is validated through simulation experiments from multiple perspectives, including the overall patrol performance, the efficiency of battery recharging strategies, and the overall fault tolerance and robustness.
    Dirac signal processing of higher-order topological signals. (arXiv:2301.10137v1 [eess.SP])
    We consider topological signals corresponding to variables supported on nodes, links and triangles of higher-order networks and simplicial complexes. So far such signals are typically processed independently of each other, and algorithms that can enforce a consistent processing of topological signals across different levels are largely lacking. Here we propose Dirac signal processing, an adaptive, unsupervised signal processing algorithm that learns to jointly filter topological signals supported on nodes, links and (filled) triangles of simplicial complexes in a consistent way. The proposed Dirac signal processing algorithm is rooted in algebraic topology and formulated in terms of the discrete Dirac operator which can be interpreted as ``square root" of a higher-order (Hodge) Laplacian matrix acting on nodes, links and triangles of simplicial complexes. We test our algorithms on noisy synthetic data and noisy data of drifters in the ocean and find that the algorithm can learn to efficiently reconstruct the true signals outperforming algorithms based exclusively on the Hodge Laplacian.
    Towards Asteroid Detection in Microlensing Surveys with Deep Learning. (arXiv:2211.02239v2 [astro-ph.EP] UPDATED)
    Asteroids are an indelible part of most astronomical surveys though only a few surveys are dedicated to their detection. Over the years, high cadence microlensing surveys have amassed several terabytes of data while scanning primarily the Galactic Bulge and Magellanic Clouds for microlensing events and thus provide a treasure trove of opportunities for scientific data mining. In particular, numerous asteroids have been observed by visual inspection of selected images. This paper presents novel deep learning-based solutions for the recovery and discovery of asteroids in the microlensing data gathered by the MOA project. Asteroid tracklets can be clearly seen by combining all the observations on a given night and these tracklets inform the structure of the dataset. Known asteroids were identified within these composite images and used for creating the labelled datasets required for supervised learning. Several custom CNN models were developed to identify images with asteroid tracklets. Model ensembling was then employed to reduce the variance in the predictions as well as to improve the generalisation error, achieving a recall of 97.67%. Furthermore, the YOLOv4 object detector was trained to localize asteroid tracklets, achieving a mean Average Precision (mAP) of 90.97%. These trained networks will be applied to 16 years of MOA archival data to find both known and unknown asteroids that have been observed by the survey over the years. The methodologies developed can be adapted for use by other surveys for asteroid recovery and discovery.
    Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging. (arXiv:2212.08479v2 [eess.IV] UPDATED)
    In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-space representation.We assign a unique coordinate that consists of time, coil index, and frequency domain location to each sampled k-space point. We then learn the subject-specific mapping from these unique coordinates to k-space intensities using a multi-layer perceptron with frequency domain regularization. During inference, we obtain a complete k-space for Cartesian coordinates and an arbitrary temporal resolution. A simple inverse Fourier transform recovers the image, eliminating the need for density compensation and costly non-uniform Fourier transforms for non-Cartesian data. This novel imaging framework was tested on 42 radially sampled datasets from 6 subjects. The proposed method outperforms other techniques qualitatively and quantitatively using data from four and one heartbeat(s) and 30 cardiac phases. Our results for one heartbeat reconstruction of 50 cardiac phases show improved artifact removal and spatio-temporal resolution, leveraging the potential for real-time CMR.
    A Learning Based Hypothesis Test for Harmful Covariate Shift. (arXiv:2212.02742v3 [cs.LG] UPDATED)
    The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
    Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. (arXiv:2210.10246v2 [cs.LG] UPDATED)
    Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the baseline.
    Quadruple-star systems are not always nested triples: a machine learning approach to dynamical stability. (arXiv:2301.09930v1 [cs.LG])
    The dynamical stability of quadruple-star systems has traditionally been treated as a problem involving two `nested' triples which constitute a quadruple. In this novel study, we employed a machine learning algorithm, the multi-layer perceptron (MLP), to directly classify 2+2 and 3+1 quadruples based on their stability (or long-term boundedness). The training data sets for the classification, comprised of $5\times10^5$ quadruples each, were integrated using the highly accurate direct $N$-body code MSTAR. We also carried out a limited parameter space study of zero-inclination systems to directly compare quadruples to triples. We found that both our quadruple MLP models perform better than a `nested' triple MLP approach, which is especially significant for 3+1 quadruples. The classification accuracies for the 2+2 MLP and 3+1 MLP models are 94% and 93% respectively, while the scores for the `nested' triple approach are 88% and 66% respectively. This is a crucial implication for quadruple population synthesis studies. Our MLP models, which are very simple and almost instantaneous to implement, are available on GitHub, along with Python3 scripts to access them.
    Exploring Effects of Computational Parameter Changes to Image Recognition Systems. (arXiv:2211.00471v3 [cs.LG] UPDATED)
    Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models.
    Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment Battery. (arXiv:2210.10952v2 [stat.ME] UPDATED)
    The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the "VBE" step in a stochastic variational Bayesian expectation maximization (VBEM) procedure that we use for approxiamte Bayesian inference on the entire model. We use the method on a sample of WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional posthoc method.
    Unsupervised Model Selection for Time-series Anomaly Detection. (arXiv:2210.01078v3 [cs.LG] UPDATED)
    Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.
    Broken Neural Scaling Laws. (arXiv:2210.14891v5 [cs.LG] UPDATED)
    We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws
    Tracking the industrial growth of modern China with high-resolution panchromatic imagery: A sequential convolutional approach. (arXiv:2301.09620v1 [cs.CV] CROSS LISTED)
    Due to insufficient or difficult to obtain data on development in inaccessible regions, remote sensing data is an important tool for interested stakeholders to collect information on economic growth. To date, no studies have utilized deep learning to estimate industrial growth at the level of individual sites. In this study, we harness high-resolution panchromatic imagery to estimate development over time at 419 industrial sites in the People's Republic of China using a multi-tier computer vision framework. We present two methods for approximating development: (1) structural area coverage estimated through a Mask R-CNN segmentation algorithm, and (2) imputing development directly with visible & infrared radiance from the Visible Infrared Imaging Radiometer Suite (VIIRS). Labels generated from these methods are comparatively evaluated and tested. On a dataset of 2,078 50 cm resolution images spanning 19 years, the results indicate that two dimensions of industrial development can be estimated using high-resolution daytime imagery, including (a) the total square meters of industrial development (average error of 0.021 $\textrm{km}^2$), and (b) the radiance of lights (average error of 9.8 $\mathrm{\frac{nW}{cm^{2}sr}}$). Trend analysis of the techniques reveal estimates from a Mask R-CNN-labeled CNN-LSTM track ground truth measurements most closely. The Mask R-CNN estimates positive growth at every site from the oldest image to the most recent, with an average change of 4,084 $\textrm{m}^2$.
    Self-Supervised Learning Through Efference Copies. (arXiv:2210.09224v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) methods aim to exploit the abundance of unlabelled data for machine learning (ML), however the underlying principles are often method-specific. An SSL framework derived from biological first principles of embodied learning could unify the various SSL methods, help elucidate learning in the brain, and possibly improve ML. SSL commonly transforms each training datapoint into a pair of views, uses the knowledge of this pairing as a positive (i.e. non-contrastive) self-supervisory sign, and potentially opposes it to unrelated, (i.e. contrastive) negative examples. Here, we show that this type of self-supervision is an incomplete implementation of a concept from neuroscience, the Efference Copy (EC). Specifically, the brain also transforms the environment through efference, i.e. motor commands, however it sends to itself an EC of the full commands, i.e. more than a mere SSL sign. In addition, its action representations are likely egocentric. From such a principled foundation we formally recover and extend SSL methods such as SimCLR, BYOL, and ReLIC under a common theoretical framework, i.e. Self-supervision Through Efference Copies (S-TEC). Empirically, S-TEC restructures meaningfully the within- and between-class representations. This manifests as improvement in recent strong SSL baselines in image classification, segmentation, object detection, and in audio. These results hypothesize a testable positive influence from the brain's motor outputs onto its sensory representations.
    Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs. (arXiv:2212.08072v2 [cs.CL] UPDATED)
    Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.
    ESTAS: Effective and Stable Trojan Attacks in Self-supervised Encoders with One Target Unlabelled Sample. (arXiv:2211.10908v2 [cs.CV] UPDATED)
    Emerging self-supervised learning (SSL) has become a popular image representation encoding method to obviate the reliance on labeled data and learn rich representations from large-scale, ubiquitous unlabelled data. Then one can train a downstream classifier on top of the pre-trained SSL image encoder with few or no labeled downstream data. Although extensive works show that SSL has achieved remarkable and competitive performance on different downstream tasks, its security concerns, e.g, Trojan attacks in SSL encoders, are still not well-studied. In this work, we present a novel Trojan Attack method, denoted by ESTAS, that can enable an effective and stable attack in SSL encoders with only one target unlabeled sample. In particular, we propose consistent trigger poisoning and cascade optimization in ESTAS to improve attack efficacy and model accuracy, and eliminate the expensive target-class data sample extraction from large-scale disordered unlabelled data. Our substantial experiments on multiple datasets show that ESTAS stably achieves > 99% attacks success rate (ASR) with one target-class sample. Compared to prior works, ESTAS attains > 30% ASR increase and > 8.3% accuracy improvement on average.
    Visual Simulation Software Demonstration for Quantum Multi-Drone Reinforcement Learning. (arXiv:2211.15375v2 [quant-ph] UPDATED)
    Quantum computing (QC) has received a lot of attention according to its light training parameter numbers and computational speeds by qubits. Moreover, various researchers have tried to enable quantum machine learning (QML) using QC, where there are also multifarious efforts to use QC to implement quantum multi-agent reinforcement learning (QMARL). Existing classical multi-agent reinforcement learning (MARL) using neural network features non-stationarity and uncertain properties due to its large number of parameters. Therefore, this paper presents a visual simulation software framework for a novel QMARL algorithm to control autonomous multi-drone systems to take advantage of QC. Our proposed QMARL framework accomplishes reasonable reward convergence and service quality performance with fewer trainable parameters than the classical MARL. Furthermore, QMARL shows more stable training results than existing MARL algorithms. Lastly, our proposed visual simulation software allows us to analyze the agents' training process and results.
    Green, Quantized Federated Learning over Wireless Networks: An Energy-Efficient Design. (arXiv:2207.09387v2 [cs.LG] UPDATED)
    In this paper, a green-quantized FL framework, which represents data with a finite precision level in both local training and uplink transmission, is proposed. Here, the finite precision level is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with quantization are rigorously derived. To minimize the energy consumption and the number of communication rounds simultaneously, a multi-objective optimization problem is formulated with respect to the number of local iterations, the number of selected devices, and the precision levels for both local training and transmission while ensuring convergence under a target accuracy constraint. To solve this problem, the convergence rate of the proposed FL system is analytically derived with respect to the system control variables. Then, the Pareto boundary of the problem is characterized to provide efficient solutions using the normal boundary inspection method. Design insights on balancing the tradeoff between the two objectives while achieving a target accuracy are drawn from using the Nash bargaining solution and analyzing the derived convergence rate. Simulation results show that the proposed FL framework can reduce energy consumption until convergence by up to 70\% compared to a baseline FL algorithm that represents data with full precision without damaging the convergence rate.
    SIAN: Style-Guided Instance-Adaptive Normalization for Multi-Organ Histopathology Image Synthesis. (arXiv:2209.02412v2 [eess.IV] UPDATED)
    Existing deep neural networks for histopathology image synthesis cannot generate image styles that align with different organs, and cannot produce accurate boundaries of clustered nuclei. To address these issues, we propose a style-guided instance-adaptive normalization (SIAN) approach to synthesize realistic color distributions and textures for histopathology images from different organs. SIAN contains four phases, semantization, stylization, instantiation, and modulation. The first two phases synthesize image semantics and styles by using semantic maps and learned image style vectors. The instantiation module integrates geometrical and topological information and generates accurate nuclei boundaries. We validate the proposed approach on a multiple-organ dataset, Extensive experimental results demonstrate that the proposed method generates more realistic histopathology images than four state-of-the-art approaches for five organs. By incorporating synthetic images from the proposed approach to model training, an instance segmentation network can achieve state-of-the-art performance.
    RAIN: RegulArization on Input and Network for Black-Box Domain Adaptation. (arXiv:2208.10531v2 [cs.CV] UPDATED)
    Source-Free domain adaptation transits the source-trained model towards target domain without exposing the source data, trying to dispel these concerns about data privacy and security. However, this paradigm is still at risk of data leakage due to adversarial attacks on the source model. Hence, the Black-Box setting only allows to use the outputs of source model, but still suffers from overfitting on the source domain more severely due to source model's unseen weights. In this paper, we propose a novel approach named RAIN (RegulArization on Input and Network) for Black-Box domain adaptation from both input-level and network-level regularization. For the input-level, we design a new data augmentation technique as Phase MixUp, which highlights task-relevant objects in the interpolations, thus enhancing input-level regularization and class consistency for target models. For network-level, we develop a Subnetwork Distillation mechanism to transfer knowledge from the target subnetwork to the full target network via knowledge distillation, which thus alleviates overfitting on the source domain by learning diverse target representations. Extensive experiments show that our method achieves state-of-the-art performance on several cross-domain benchmarks under both single- and multi-source black-box domain adaptation.
    A computational framework for physics-informed symbolic regression with straightforward integration of domain knowledge. (arXiv:2209.06257v3 [cs.LG] UPDATED)
    Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMED), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMED combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMED on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMED is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages , even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMED can alert the user about possible missing features, unlike the majority of current SR systems.
    Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v4 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    Learning to Counter: Stochastic Feature-based Learning for Diverse Counterfactual Explanations. (arXiv:2209.13446v2 [cs.AI] UPDATED)
    Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One flourishing approach is through counterfactual explanations, which provide suggestions on what a user can do to alter an outcome. Not only must a counterfactual example counter the original prediction from the black-box classifier but it should also satisfy various constraints for practical applications. Diversity is one of the critical constraints that however remains less discussed. While diverse counterfactuals are ideal, it is computationally challenging to simultaneously address some other constraints. Furthermore, there is a growing privacy concern over the released counterfactual data. To this end, we propose a feature-based learning framework that effectively handles the counterfactual constraints and contributes itself to the limited pool of private explanation models. We demonstrate the flexibility and effectiveness of our method in generating diverse counterfactuals of actionability and plausibility. Our counterfactual engine is more efficient than counterparts of the same capacity while yielding the lowest re-identification risks.
    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.
    Can large language models reason about medical questions?. (arXiv:2207.08143v3 [cs.CL] UPDATED)
    Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions. We utilize two multiple-choice medical exam questions (USMLE and MedMCQA) and a medical reading comprehension dataset (PubMedQA). We investigate multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), zero- and few-shot (prepending the question with question-answer exemplars) and retrieval augmentation (injecting Wikipedia passages into the prompt). For a subset of the USMLE questions, a medical expert reviewed and annotated the model's CoT. We found that InstructGPT can often read, reason and recall expert knowledge. Failure are primarily due to lack of knowledge and reasoning errors and trivial guessing heuristics are observed, e.g.\ too often predicting labels A and D on USMLE. Sampling and combining many completions overcome some of these limitations. Using 100 samples, Codex 5-shot CoT not only gives close to well-calibrated predictive probability but also achieves human-level performances on the three datasets. USMLE: 60.2%, MedMCQA: 62.7% and PubMedQA: 78.2%.
    To Trust or Not To Trust Prediction Scores for Membership Inference Attacks. (arXiv:2111.09076v3 [cs.LG] UPDATED)
    Membership inference attacks (MIAs) aim to determine whether a specific sample was used to train a predictive model. Knowing this may indeed lead to a privacy breach. Most MIAs, however, make use of the model's prediction scores - the probability of each output given some input - following the intuition that the trained model tends to behave differently on its training data. We argue that this is a fallacy for many modern deep network architectures. Consequently, MIAs will miserably fail since overconfidence leads to high false-positive rates not only on known domains but also on out-of-distribution data and implicitly acts as a defense against MIAs. Specifically, using generative adversarial networks, we are able to produce a potentially infinite number of samples falsely classified as part of the training data. In other words, the threat of MIAs is overestimated, and less information is leaked than previously assumed. Moreover, there is actually a trade-off between the overconfidence of models and their susceptibility to MIAs: the more classifiers know when they do not know, making low confidence predictions, the more they reveal the training data.
    Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials. (arXiv:2202.05255v3 [cond-mat.mtrl-sci] UPDATED)
    Topological materials present unconventional electronic properties that make them attractive for both basic science and next-generation technological applications. The majority of currently known topological materials have been discovered using methods that involve symmetry-based analysis of the quantum wavefunction. Here we use machine learning to develop a simple-to-use heuristic chemical rule that diagnoses with a high accuracy whether a material is topological using only its chemical formula. This heuristic rule is based on a notion that we term topogivity, a machine-learned numerical value for each element that loosely captures its tendency to form topological materials. We next implement a high-throughput procedure for discovering topological materials based on the heuristic topogivity-rule prediction followed by ab initio validation. This way, we discover new topological materials that are not diagnosable using symmetry indicators, including several that may be promising for experimental observation.
    Gaze-based Object Detection in the Wild. (arXiv:2203.15651v2 [cs.RO] UPDATED)
    In human-robot collaboration, one challenging task is to teach a robot new yet unknown objects enabling it to interact with them. Thereby, gaze can contain valuable information. We investigate if it is possible to detect objects (object or no object) merely from gaze data and determine their bounding box parameters. For this purpose, we explore different sizes of temporal windows, which serve as a basis for the computation of heatmaps, i.e., the spatial distribution of the gaze data. Additionally, we analyze different grid sizes of these heatmaps, and demonstrate the functionality in a proof of concept using different machine learning techniques. Our method is characterized by its speed and resource efficiency compared to conventional object detectors. In order to generate the required data, we conducted a study with five subjects who could move freely and thus, turn towards arbitrary objects. This way, we chose a scenario for our data collection that is as realistic as possible. Since the subjects move while facing objects, the heatmaps also contain gaze data trajectories, complicating the detection and parameter regression. We make our data set publicly available to the research community for download.
    Mixed Effects Random Forests for Personalised Predictions of Clinical Depression Severity. (arXiv:2301.09815v1 [cs.LG])
    This work demonstrates how mixed effects random forests enable accurate predictions of depression severity using multimodal physiological and digital activity data collected from an 8-week study involving 31 patients with major depressive disorder. We show that mixed effects random forests outperform standard random forests and personal average baselines when predicting clinical Hamilton Depression Rating Scale scores (HDRS_17). Compared to the latter baseline, accuracy is significantly improved for each patient by an average of 0.199-0.276 in terms of mean absolute error (p<0.05). This is noteworthy as these simple baselines frequently outperform machine learning methods in mental health prediction tasks. We suggest that this improved performance results from the ability of the mixed effects random forest to personalise model parameters to individuals in the dataset. However, we find that these improvements pertain exclusively to scenarios where labelled patient data are available to the model at training time. Investigating methods that improve accuracy when generalising to new patients is left as important future work.
    Integrating Reward Maximization and Population Estimation: Sequential Decision-Making for Internal Revenue Service Audit Selection. (arXiv:2204.11910v3 [cs.LG] UPDATED)
    We introduce a new setting, optimize-and-estimate structured bandits. Here, a policy must select a batch of arms, each characterized by its own context, that would allow it to both maximize reward and maintain an accurate (ideally unbiased) population estimate of the reward. This setting is inherent to many public and private sector applications and often requires handling delayed feedback, small data, and distribution shifts. We demonstrate its importance on real data from the United States Internal Revenue Service (IRS). The IRS performs yearly audits of the tax base. Two of its most important objectives are to identify suspected misreporting and to estimate the "tax gap" -- the global difference between the amount paid and true amount owed. Based on a unique collaboration with the IRS, we cast these two processes as a unified optimize-and-estimate structured bandit. We analyze optimize-and-estimate approaches to the IRS problem and propose a novel mechanism for unbiased population estimation that achieves rewards comparable to baseline approaches. This approach has the potential to improve audit efficacy, while maintaining policy-relevant estimates of the tax gap. This has important social consequences given that the current tax gap is estimated at nearly half a trillion dollars. We suggest that this problem setting is fertile ground for further research and we highlight its interesting challenges. The results of this and related research are currently being incorporated into the continual improvement of the IRS audit selection methods.
    Efficient Planning in a Compact Latent Action Space. (arXiv:2208.10291v3 [cs.LG] UPDATED)
    Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.
    A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range. (arXiv:2203.13273v5 [cs.LG] UPDATED)
    We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient g_t and its first momentum m_t before using them to estimate the second momentum. The new optimization method is referred to as Aida. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet. Code is available at this URL
    Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training. (arXiv:2202.07993v2 [cs.CV] UPDATED)
    Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation - which we call Planckian Jitter - that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.
    MTTN: Multi-Pair Text to Text Narratives for Prompt Generation. (arXiv:2301.10172v1 [cs.CL])
    The explosive popularity of diffusion models[ 1][ 2][ 3 ] has provided a huge stage for further development in generative-text modelling. As prompt based models are very nuanced, such that a carefully generated prompt can produce truely breath taking images, on the contrary producing powerful or even meaningful prompt is a hit or a miss. To lavish on this we have introduced a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO[4 ], Flickr[ 5], etc. We have also introduced staging for these sentences that sequentially reduce the context and increase the complexity, that will further strengthen the output because of the complex annotations that are being created. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations. The original 2.4M million pairs are broken down in such a manner that it produces a true scenario of internet lingo that is used globally thereby heightening the robustness of the dataset, and any model trained on it.
    RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving. (arXiv:2301.10222v1 [cs.CV])
    Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.
    A Watermark for Large Language Models. (arXiv:2301.10226v1 [cs.LG])
    Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of whitelist tokens before a word is generated, and then softly promoting use of whitelist tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
    HTMOT : Hierarchical Topic Modelling Over Time. (arXiv:2112.03104v2 [cs.IR] UPDATED)
    Over the years, topic models have provided an efficient way of extracting insights from text. However, while many models have been proposed, none are able to model topic temporality and hierarchy jointly. Modelling time provide more precise topics by separating lexically close but temporally distinct topics while modelling hierarchy provides a more detailed view of the content of a document corpus. In this study, we therefore propose a novel method, HTMOT, to perform Hierarchical Topic Modelling Over Time. We train HTMOT using a new implementation of Gibbs sampling, which is more efficient. Specifically, we show that only applying time modelling to deep sub-topics provides a way to extract specific stories or events while high level topics extract larger themes in the corpus. Our results show that our training procedure is fast and can extract accurate high-level topics and temporally precise sub-topics. We measured our model's performance using the Word Intrusion task and outlined some limitations of this evaluation method, especially for hierarchical models. As a case study, we focused on the various developments in the space industry in 2020.
    Leveraging Vision-Language Models for Granular Market Change Prediction. (arXiv:2301.10166v1 [q-fin.ST])
    Predicting future direction of stock markets using the historical data has been a fundamental component in financial forecasting. This historical data contains the information of a stock in each specific time span, such as the opening, closing, lowest, and highest price. Leveraging this data, the future direction of the market is commonly predicted using various time-series models such as Long-Short Term Memory networks. This work proposes modeling and predicting market movements with a fundamentally new approach, namely by utilizing image and byte-based number representation of the stock data processed with the recently introduced Vision-Language models. We conduct a large set of experiments on the hourly stock data of the German share index and evaluate various architectures on stock price prediction using historical stock data. We conduct a comprehensive evaluation of the results with various metrics to accurately depict the actual performance of various approaches. Our evaluation results show that our novel approach based on representation of stock data as text (bytes) and image significantly outperforms strong deep learning-based baselines.
    WEASEL 2.0 -- A Random Dilated Dictionary Transform for Fast, Accurate and Memory Constrained Time Series Classification. (arXiv:2301.10194v1 [cs.LG])
    A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important components of the currently most accurate TSC ensembles. One of the early dictionary-based methods was WEASEL, which at its time achieved SotA results while also being very fast. However, it is outperformed both in terms of speed and accuracy by other methods. Furthermore, its design leads to an unpredictably large memory footprint, making it inapplicable for many applications. In this paper, we present WEASEL 2.0, a complete overhaul of WEASEL based on two recent advancements in TSC: Dilation and ensembling of randomized hyper-parameter settings. These two techniques allow WEASEL 2.0 to work with a fixed-size memory footprint while at the same time improving accuracy. Compared to 15 other SotA methods on the UCR benchmark set, WEASEL 2.0 is significantly more accurate than other dictionary methods and not significantly worse than the currently best methods. Actually, it achieves the highest median accuracy over all data sets, and it performs best in 5 out of 12 problem classes. We thus believe that WEASEL 2.0 is a viable alternative for current TSC and also a potentially interesting input for future ensembles.
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.
    Fine-grained Early Frequency Attention for Deep Speaker Representation Learning. (arXiv:2009.01822v2 [eess.AS] UPDATED)
    Deep learning techniques have considerably improved speech processing in recent years. Speaker representations extracted by deep learning models are being used in a wide range of tasks such as speaker recognition and speech emotion recognition. Attention mechanisms have started to play an important role in improving deep learning models in the field of speech processing. Nonetheless, despite the fact that important speaker-related information can be embedded in individual frequency-bins of the input spectral representations, current attention models are unable to attend to fine-grained information items in spectral representations. In this paper we propose Fine-grained Early Frequency Attention (FEFA) for speaker representation learning. Our model is a simple and lightweight model that can be integrated into various CNN pipelines and is capable of focusing on information items as small as frequency-bins. We evaluate the proposed model on three tasks of speaker recognition, speech emotion recognition, and spoken digit recognition. We use Three widely used public datasets, namely VoxCeleb, IEMOCAP, and Free Spoken Digit Dataset for our experiments. We attach FEFA to several prominent deep learning models and evaluate its impact on the final performance. We also compare our work with other related works in the area. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, and the models equipped with FEFA outperform all the other attentive models. We also test our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Topological Understanding of Neural Networks, a survey. (arXiv:2301.09742v1 [cs.LG])
    We look at the internal structure of neural networks which is usually treated as a black box. The easiest and the most comprehensible thing to do is to look at a binary classification and try to understand the approach a neural network takes. We review the significance of different activation functions, types of network architectures associated to them, and some empirical data. We find some interesting observations and a possibility to build upon the ideas to verify the process for real datasets. We suggest some possible experiments to look forward to in three different directions.
    A predictive physics-aware hybrid reduced order model for reacting flows. (arXiv:2301.09860v1 [cs.LG])
    In this work, a new hybrid predictive Reduced Order Model (ROM) is proposed to solve reacting flow problems. This algorithm is based on a dimensionality reduction using Proper Orthogonal Decomposition (POD) combined with deep learning architectures. The number of degrees of freedom is reduced from thousands of temporal points to a few POD modes with their corresponding temporal coefficients. Two different deep learning architectures have been tested to predict the temporal coefficients, based on recursive (RNN) and convolutional (CNN) neural networks. From each architecture, different models have been created to understand the behavior of each parameter of the neural network. Results show that these architectures are able to predict the temporal coefficients of the POD modes, as well as the whole snapshots. The RNN shows lower prediction error for all the variables analyzed. The model was also found capable of predicting more complex simulations showing transfer learning capabilities.
    Towards Modular Machine Learning Solution Development: Benefits and Trade-offs. (arXiv:2301.09753v1 [cs.LG])
    Machine learning technologies have demonstrated immense capabilities in various domains. They play a key role in the success of modern businesses. However, adoption of machine learning technologies has a lot of untouched potential. Cost of developing custom machine learning solutions that solve unique business problems is a major inhibitor to far-reaching adoption of machine learning technologies. We recognize that the monolithic nature prevalent in today's machine learning applications stands in the way of efficient and cost effective customized machine learning solution development. In this work we explore the benefits of modular machine learning solutions and discuss how modular machine learning solutions can overcome some of the major solution engineering limitations of monolithic machine learning solutions. We analyze the trade-offs between modular and monolithic machine learning solutions through three deep learning problems; one text based and the two image based. Our experimental results show that modular machine learning solutions have a promising potential to reap the solution engineering advantages of modularity while gaining performance and data advantages in a way the monolithic machine learning solutions do not permit.
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Interpretable Tsetlin Machine-based Premature Ventricular Contraction Identification. (arXiv:2301.10181v1 [eess.SP])
    Neural network-based models have found wide use in automatic long-term electrocardiogram (ECG) analysis. However, such black box models are inadequate for analysing physiological signals where credibility and interpretability are crucial. Indeed, how to make ECG analysis transparent is still an open problem. In this study, we develop a Tsetlin machine (TM) based architecture for premature ventricular contraction (PVC) identification by analysing long-term ECG signals. The architecture is transparent by describing patterns directly with logical AND rules. To validate the accuracy of our approach, we compare the TM performance with those of convolutional neural networks (CNNs). Our numerical results demonstrate that TM provides comparable performance with CNNs on the MIT-BIH database. To validate interpretability, we provide explanatory diagrams that show how TM makes the PVC identification from confirming and invalidating patterns. We argue that these are compatible with medical knowledge so that they can be readily understood and verified by a medical doctor. Accordingly, we believe this study paves the way for machine learning (ML) for ECG analysis in clinical practice.
    Inference of Continuous Linear Systems from Data with Guaranteed Stability. (arXiv:2301.10060v1 [cs.LG])
    Machine-learning technologies for learning dynamical systems from data play an important role in engineering design. This research focuses on learning continuous linear models from data. Stability, a key feature of dynamic systems, is especially important in design tasks such as prediction and control. Thus, there is a need to develop methodologies that provide stability guarantees. To that end, we leverage the parameterization of stable matrices proposed in [Gillis/Sharma, Automatica, 2017] to realize the desired models. Furthermore, to avoid the estimation of derivative information to learn continuous systems, we formulate the inference problem in an integral form. We also discuss a few extensions, including those related to control systems. Numerical experiments show that the combination of a stable matrix parameterization and an integral form of differential equations allows us to learn stable systems without requiring derivative information, which can be challenging to obtain in situations with noisy or limited data.
    Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning. (arXiv:2301.10119v1 [cs.LG])
    Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-equivalent partial models". After providing a formal definition for these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Then, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.  ( 2 min )
    Autonomous particles. (arXiv:2301.10077v1 [cs.LG])
    Consider a reinforcement learning problem where an agent has access to a very large amount of information about the environment, but it can only take very few actions to accomplish its task and to maximize its reward. Evidently, the main problem for the agent is to learn a map from a very high-dimensional space (which represents its environment) to a very low-dimensional space (which represents its actions). The high-to-low dimensional map implies that most of the information about the environment is irrelevant for the actions to be taken, and only a small fraction of information is relevant. In this paper we argue that the relevant information need not be learned by brute force (which is the standard approach), but can be identified from the intrinsic symmetries of the system. We analyze in details a reinforcement learning problem of autonomous driving, where the corresponding symmetry is the Galilean symmetry, and argue that the learning task can be accomplished with very few relevant parameters, or, more precisely, invariants. For a numerical demonstration, we show that the autonomous vehicles (which we call autonomous particles since they describe very primitive vehicles) need only four relevant invariants to learn how to drive very well without colliding with other particles. The simple model can be easily generalized to include different types of particles (e.g. for cars, for pedestrians, for buildings, for road signs, etc.) with different types of relevant invariants describing interactions between them. We also argue that there must exist a field theory description of the learning system where autonomous particles would be described by fermionic degrees of freedom and interactions mediated by the relevant invariants would be described by bosonic degrees of freedom.  ( 2 min )
    Differentiable bit-rate estimation for neural-based video codec enhancement. (arXiv:2301.09776v1 [eess.IV])
    Neural networks (NN) can improve standard video compression by pre- and post-processing the encoded video. For optimal NN training, the standard codec needs to be replaced with a codec proxy that can provide derivatives of estimated bit-rate and distortion, which are used for gradient back-propagation. Since entropy coding of standard codecs is designed to take into account non-linear dependencies between transform coefficients, bit-rates cannot be well approximated with simple per-coefficient estimators. This paper presents a new approach for bit-rate estimation that is similar to the type employed in training end-to-end neural codecs, and able to efficiently take into account those statistical dependencies. It is defined from a mathematical model that provides closed-form formulas for the estimates and their gradients, reducing the computational complexity. Experimental results demonstrate the method's accuracy in estimating HEVC/H.265 codec bit-rates.  ( 2 min )
    Explainable Data-Driven Optimization: From Context to Decision and Back Again. (arXiv:2301.10074v1 [cs.LG])
    Data-driven optimization uses contextual information and machine learning algorithms to find solutions to decision problems with uncertain parameters. While a vast body of work is dedicated to interpreting machine learning models in the classification setting, explaining decision pipelines involving learning algorithms remains unaddressed. This lack of interpretability can block the adoption of data-driven solutions as practitioners may not understand or trust the recommended decisions. We bridge this gap by introducing a counterfactual explanation methodology tailored to explain solutions to data-driven problems. We introduce two classes of explanations and develop methods to find nearest explanations of random forest and nearest-neighbor predictors. We demonstrate our approach by explaining key problems in operations management such as inventory management and routing.  ( 2 min )
    PolarAir: A Compressed Sensing Scheme for Over-the-Air Federated Learning. (arXiv:2301.10110v1 [cs.IT])
    We explore a scheme that enables the training of a deep neural network in a Federated Learning configuration over an additive white Gaussian noise channel. The goal is to create a low complexity, linear compression strategy, called PolarAir, that reduces the size of the gradient at the user side to lower the number of channel uses needed to transmit it. The suggested approach belongs to the family of compressed sensing techniques, yet it constructs the sensing matrix and the recovery procedure using multiple access techniques. Simulations show that it can reduce the number of channel uses by ~30% when compared to conveying the gradient without compression. The main advantage of the proposed scheme over other schemes in the literature is its low time complexity. We also investigate the behavior of gradient updates and the performance of PolarAir throughout the training process to obtain insight on how best to construct this compression scheme based on compressed sensing.  ( 2 min )
    Intrinsic Motivation in Model-based Reinforcement Learning: A Brief Review. (arXiv:2301.10067v1 [cs.LG])
    The reinforcement learning research area contains a wide range of methods for solving the problems of intelligent agent control. Despite the progress that has been made, the task of creating a highly autonomous agent is still a significant challenge. One potential solution to this problem is intrinsic motivation, a concept derived from developmental psychology. This review considers the existing methods for determining intrinsic motivation based on the world model obtained by the agent. We propose a systematic approach to current research in this field, which consists of three categories of methods, distinguished by the way they utilize a world model in the agent's components: complementary intrinsic reward, exploration policy, and intrinsically motivated goals. The proposed unified framework describes the architecture of agents using a world model and intrinsic motivation to improve learning. The potential for developing new techniques in this area of research is also examined.  ( 2 min )
    A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data. (arXiv:2301.10053v1 [cs.LG])
    Personal data collected at scale from surveys or digital devices offers important insights for statistical analysis and scientific research. Safely sharing such data while protecting privacy is however challenging. Anonymization allows data to be shared while minimizing privacy risks, but traditional anonymization techniques have been repeatedly shown to provide limited protection against re-identification attacks in practice. Among modern anonymization techniques, synthetic data generation (SDG) has emerged as a potential solution to find a good tradeoff between privacy and statistical utility. Synthetic data is typically generated using algorithms that learn the statistical distribution of the original records, to then generate "artificial" records that are structurally and statistically similar to the original ones. Yet, the fact that synthetic records are "artificial" does not, per se, guarantee that privacy is protected. In this work, we systematically evaluate the tradeoffs between protecting privacy and preserving statistical utility for a wide range of synthetic data generation algorithms. Modeling privacy as protection against attribute inference attacks (AIAs), we extend and adapt linear reconstruction attacks, which have not been previously studied in the context of synthetic data. While prior work suggests that AIAs may be effective only on few outlier records, we show they can be very effective even on randomly selected records. We evaluate attacks on synthetic datasets ranging from 10^3 to 10^6 records, showing that even for the same generative model, the attack effectiveness can drastically increase when a larger number of synthetic records is generated. Overall, our findings prove that synthetic data is subject to privacy-utility tradeoffs just like other anonymization techniques: when good utility is preserved, attribute inference can be a risk for many data subjects.  ( 2 min )
    SMART: Self-supervised Multi-task pretrAining with contRol Transformers. (arXiv:2301.09816v1 [cs.LG])
    Self-supervised pretraining has been extensively studied in language and vision domains, where a unified model can be easily adapted to various downstream tasks by pretraining representations without explicit labels. When it comes to sequential decision-making tasks, however, it is difficult to properly design such a pretraining approach that can cope with both high-dimensional perceptual information and the complexity of sequential control over long interaction horizons. The challenge becomes combinatorially more complex if we want to pretrain representations amenable to a large variety of tasks. To tackle this problem, in this work, we formulate a general pretraining-finetuning pipeline for sequential decision making, under which we propose a generic pretraining framework \textit{Self-supervised Multi-task pretrAining with contRol Transformer (SMART)}. By systematically investigating pretraining regimes, we carefully design a Control Transformer (CT) coupled with a novel control-centric pretraining objective in a self-supervised manner. SMART encourages the representation to capture the common essential information relevant to short-term control and long-term control, which is transferrable across tasks. We show by extensive experiments in DeepMind Control Suite that SMART significantly improves the learning efficiency among seen and unseen downstream tasks and domains under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality pretraining datasets that are randomly collected.  ( 2 min )
    Koopman neural operator as a mesh-free solver of non-linear partial differential equations. (arXiv:2301.10022v1 [cs.LG])
    The lacking of analytic solutions of diverse partial differential equations (PDEs) gives birth to series of computational techniques for numerical solutions. In machine learning, numerous latest advances of solver designs are accomplished in developing neural operators, a kind of mesh-free approximators of the infinite-dimensional operators that map between different parameterization spaces of equation solutions. Although neural operators exhibit generalization capacities for learning an entire PDE family simultaneously, they become less accurate and explainable while learning long-term behaviours of non-linear PDE families. In this paper, we propose Koopman neural operator (KNO), a new neural operator, to overcome these challenges. With the same objective of learning an infinite-dimensional mapping between Banach spaces that serves as the solution operator of target PDE family, our approach differs from existing models by formulating a non-linear dynamic system of equation solution. By approximating the Koopman operator, an infinite-dimensional linear operator governing all possible observations of the dynamic system, to act on the flow mapping of dynamic system, we can equivalently learn the solution of an entire non-linear PDE family by solving simple linear prediction problems. In zero-shot prediction and long-term prediction experiments on representative PDEs (e.g., the Navier-Stokes equation), KNO exhibits notable advantages in breaking the tradeoff between accuracy and efficiency (e.g., model size) while previous state-of-the-art models are limited. These results suggest that more efficient PDE solvers can be developed by the joint efforts from physics and machine learning.  ( 2 min )
    Robust Fair Clustering: A Novel Fairness Attack and Defense Framework. (arXiv:2210.01953v2 [cs.LG] UPDATED)
    Clustering algorithms are widely used in many societal resource allocation applications, such as loan approvals and candidate recruitment, among others, and hence, biased or unfair model outputs can adversely impact individuals that rely on these applications. To this end, many fair clustering approaches have been recently proposed to counteract this issue. Due to the potential for significant harm, it is essential to ensure that fair clustering algorithms provide consistently fair outputs even under adversarial influence. However, fair clustering algorithms have not been studied from an adversarial attack perspective. In contrast to previous research, we seek to bridge this gap and conduct a robustness analysis against fair clustering by proposing a novel black-box fairness attack. Through comprehensive experiments, we find that state-of-the-art models are highly susceptible to our attack as it can reduce their fairness performance significantly. Finally, we propose Consensus Fair Clustering (CFC), the first robust fair clustering approach that transforms consensus clustering into a fair graph partitioning problem, and iteratively learns to generate fair cluster outputs. Experimentally, we observe that CFC is highly robust to the proposed attack and is thus a truly robust fair clustering alternative.  ( 2 min )
    3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation. (arXiv:2209.15076v3 [cs.CV] UPDATED)
    The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900). The source code with our proposed model are available at https://github.com/MASILab/3DUX-Net.  ( 2 min )
    Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. (arXiv:2301.09830v1 [cs.LG])
    In training of modern large natural language processing (NLP) models, it has become a common practice to split models using 3D parallelism to multiple GPUs. Such technique, however, suffers from a high overhead of inter-node communication. Compressing the communication is one way to mitigate the overhead by reducing the inter-node traffic volume; however, the existing compression techniques have critical limitations to be applied for NLP models with 3D parallelism in that 1) only the data parallelism traffic is targeted, and 2) the existing compression schemes already harm the model quality too much. In this paper, we present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression. Optimus-CC differs from existing communication compression frameworks in the following ways: First, we compress pipeline parallel (inter-stage) traffic. In specific, we compress the inter-stage backpropagation and the embedding synchronization in addition to the existing data-parallel traffic compression methods. Second, we propose techniques to avoid the model quality drop that comes from the compression. We further provide mathematical and empirical analyses to show that our techniques can successfully suppress the compression error. Lastly, we analyze the pipeline and opt to selectively compress those traffic lying on the critical path. This further helps reduce the compression error. We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.  ( 2 min )
    When does the student surpass the teacher? Federated Semi-supervised Learning with Teacher-Student EMA. (arXiv:2301.10114v1 [cs.LG])
    Semi-Supervised Learning (SSL) has received extensive attention in the domain of computer vision, leading to development of promising approaches such as FixMatch. In scenarios where training data is decentralized and resides on client devices, SSL must be integrated with privacy-aware training techniques such as Federated Learning. We consider the problem of federated image classification and study the performance and privacy challenges with existing federated SSL (FSSL) approaches. Firstly, we note that even state-of-the-art FSSL algorithms can trivially compromise client privacy and other real-world constraints such as client statelessness and communication cost. Secondly, we observe that it is challenging to integrate EMA (Exponential Moving Average) updates into the federated setting, which comes at a trade-off between performance and communication cost. We propose a novel approach FedSwitch, that improves privacy as well as generalization performance through Exponential Moving Average (EMA) updates. FedSwitch utilizes a federated semi-supervised teacher-student EMA framework with two features - local teacher adaptation and adaptive switching between teacher and student for pseudo-label generation. Our proposed approach outperforms the state-of-the-art on federated image classification, can be adapted to real-world constraints, and achieves good generalization performance with minimal communication cost overhead.  ( 2 min )
    Proceedings of the 1st International Workshop on Reading Music Systems. (arXiv:2301.10062v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 1st International Workshop on Reading Music Systems, held in Paris on the 20th of September 2018.  ( 2 min )
    Quantum Heavy-tailed Bandits. (arXiv:2301.09680v1 [cs.LG])
    In this paper, we study multi-armed bandits (MAB) and stochastic linear bandits (SLB) with heavy-tailed rewards and quantum reward oracle. Unlike the previous work on quantum bandits that assumes bounded/sub-Gaussian distributions for rewards, here we investigate the quantum bandits problem under a weaker assumption that the distributions of rewards only have bounded $(1+v)$-th moment for some $v\in (0,1]$. In order to achieve regret improvements for heavy-tailed bandits, we first propose a new quantum mean estimator for heavy-tailed distributions, which is based on the Quantum Monte Carlo Mean Estimator and achieves a quadratic improvement of estimation error compared to the classical one. Based on our quantum mean estimator, we focus on quantum heavy-tailed MAB and SLB and propose quantum algorithms based on the Upper Confidence Bound (UCB) framework for both problems with $\Tilde{O}(T^{\frac{1-v}{1+v}})$ regrets, polynomially improving the dependence in terms of $T$ as compared to classical (near) optimal regrets of $\Tilde{O}(T^{\frac{1}{1+v}})$, where $T$ is the number of rounds. Finally, experiments also support our theoretical results and show the effectiveness of our proposed methods.  ( 2 min )
    From Robots to Books: An Introduction to Smart Applications of AI in Education (AIEd). (arXiv:2301.10026v1 [cs.CY])
    The world around us has undergone a radical transformation due to rapid technological advancement in recent decades. The industry of the future generation is evolving, and artificial intelligence is the following change in the making popularly known as Industry 4.0. Indeed, experts predict that artificial intelligence(AI) will be the main force behind the following significant virtual shift in the way we stay, converse, study, live, communicate and conduct business. All facets of our social connection are being transformed by this growing technology. One of the newest areas of educational technology is Artificial Intelligence in the field of Education(AIEd).This study emphasizes the different applications of artificial intelligence in education from both an industrial and academic standpoint. It highlights the most recent contextualized learning novel transformative evaluations and advancements in sophisticated tutoring systems. It analyses the AIEd's ethical component and the influence of the transition on people, particularly students and instructors as well. Finally, this article touches on AIEd's potential future research and practices. The goal of this study is to introduce the present-day applications to its intended audience.  ( 2 min )
    FedPrompt: Communication-Efficient and Privacy Preserving Prompt Tuning in Federated Learning. (arXiv:2208.12268v3 [cs.LG] UPDATED)
    Federated learning (FL) has enabled global model training on decentralized data in a privacy-preserving way by aggregating model updates. However, for many natural language processing (NLP) tasks that utilize pre-trained language models (PLMs) with large numbers of parameters, there are considerable communication costs associated with FL. Recently, prompt tuning, which tunes some soft prompts without modifying PLMs, has achieved excellent performance as a new learning paradigm. Therefore we want to combine the two methods and explore the effect of prompt tuning under FL. In this paper, we propose "FedPrompt" to study prompt tuning in a model split aggregation way using FL, and prove that split aggregation greatly reduces the communication cost, only 0.01% of the PLMs' parameters, with little decrease on accuracy both on IID and Non-IID data distribution. This improves the efficiency of FL method while also protecting the data privacy in prompt tuning. In addition, like PLMs, prompts are uploaded and downloaded between public platforms and personal users, so we try to figure out whether there is still a backdoor threat using only soft prompts in FL scenarios. We further conduct backdoor attacks by data poisoning on FedPrompt. Our experiments show that normal backdoor attack can not achieve a high attack success rate, proving the robustness of FedPrompt. We hope this work can promote the application of prompt in FL and raise the awareness of the possible security threats.  ( 2 min )
    Efficient learning of large sets of locally optimal classification rules. (arXiv:2301.09936v1 [cs.LG])
    Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach's average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.  ( 2 min )
    Model Agnostic Sample Reweighting for Out-of-Distribution Learning. (arXiv:2301.09819v1 [cs.LG])
    Distributionally robust optimization (DRO) and invariant risk minimization (IRM) are two popular methods proposed to improve out-of-distribution (OOD) generalization performance of machine learning models. While effective for small models, it has been observed that these methods can be vulnerable to overfitting with large overparameterized models. This work proposes a principled method, \textbf{M}odel \textbf{A}gnostic sam\textbf{PL}e r\textbf{E}weighting (\textbf{MAPLE}), to effectively address OOD problem, especially in overparameterized scenarios. Our key idea is to find an effective reweighting of the training samples so that the standard empirical risk minimization training of a large model on the weighted training data leads to superior OOD generalization performance. The overfitting issue is addressed by considering a bilevel formulation to search for the sample reweighting, in which the generalization complexity depends on the search space of sample weights instead of the model size. We present theoretical analysis in linear case to prove the insensitivity of MAPLE to model size, and empirically verify its superiority in surpassing state-of-the-art methods by a large margin. Code is available at \url{https://github.com/x-zho14/MAPLE}.  ( 2 min )
    Spectral Cross-Domain Neural Network with Soft-adaptive Threshold Spectral Enhancement. (arXiv:2301.10171v1 [cs.LG])
    Electrocardiography (ECG) signals can be considered as multi-variable time-series. The state-of-the-art ECG data classification approaches, based on either feature engineering or deep learning techniques, treat separately spectral and time domains in machine learning systems. No spectral-time domain communication mechanism inside the classifier model can be found in current approaches, leading to difficulties in identifying complex ECG forms. In this paper, we proposed a novel deep learning model named Spectral Cross-domain neural network (SCDNN) with a new block called Soft-adaptive threshold spectral enhancement (SATSE), to simultaneously reveal the key information embedded in spectral and time domains inside the neural network. More precisely, the domain-cross information is captured by a general Convolutional neural network (CNN) backbone, and different information sources are merged by a self-adaptive mechanism to mine the connection between time and spectral domains. In SATSE, the knowledge from time and spectral domains is extracted via the Fast Fourier Transformation (FFT) with soft trainable thresholds in modified Sigmoid functions. The proposed SCDNN is tested with several classification tasks implemented on the public ECG databases \textit{PTB-XL} and \textit{MIT-BIH}. SCDNN outperforms the state-of-the-art approaches with a low computational cost regarding a variety of metrics in all classification tasks on both databases, by finding appropriate domains from the infinite spectral mapping. The convergence of the trainable thresholds in the spectral domain is also numerically investigated in this paper. The robust performance of SCDNN provides a new perspective to exploit knowledge across deep learning models from time and spectral domains. The repository can be found: https://github.com/DL-WG/SCDNN-TS  ( 2 min )
    Multi-view Kernel PCA for Time series Forecasting. (arXiv:2301.09811v1 [cs.LG])
    In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.  ( 2 min )
    Quantification of Damage Using Indirect Structural Health Monitoring. (arXiv:2301.09791v1 [cs.LG])
    Structural health monitoring is important to make sure bridges do not fail. Since direct monitoring can be complicated and expensive, indirect methods have been a focus on research. Indirect monitoring can be much cheaper and easier to conduct, however there are challenges with getting accurate results. This work focuses on damage quantification by using accelerometers. Tests were conducted on a model bridge and car with four accelerometers attached to to the vehicle. Different weights were placed on the bridge to simulate different levels of damage, and 31 tests were run for 20 different damage levels. The acceleration data collected was normalized and a Fast-Fourier Transform (FFT) was performed on that data. Both the normalized acceleration data and the normalized FFT data were inputted into a Non-Linear Principal Component Analysis (separately) and three principal components were extracted for each data set. Support Vector Regression (SVR) and Gaussian Process Regression (GPR) were used as the supervised machine learning methods to develop models. Multiple models were created so that the best one could be selected, and the models were compared by looking at their Mean Squared Errors (MSE). This methodology should be applied in the field to measure how effective it can be in real world applications.  ( 2 min )
    Feature-based Image Matching for Identifying Individual K\=ak\=a. (arXiv:2301.06678v2 [cs.CV] UPDATED)
    This report investigates an unsupervised, feature-based image matching pipeline for the novel application of identifying individual k\=ak\=a. Applied with a similarity network for clustering, this addresses a weakness of current supervised approaches to identifying individual birds which struggle to handle the introduction of new individuals to the population. Our approach uses object localisation to locate k\=ak\=a within images and then extracts local features that are invariant to rotation and scale. These features are matched between images with nearest neighbour matching techniques and mismatch removal to produce a similarity score for image match comparison. The results show that matches obtained via the image matching pipeline achieve high accuracy of true matches. We conclude that feature-based image matching could be used with a similarity network to provide a viable alternative to existing supervised approaches.  ( 2 min )
    Accurate Detection of Paroxysmal Atrial Fibrillation with Certified-GAN and Neural Architecture Search. (arXiv:2301.10173v1 [cs.LG])
    This paper presents a novel machine learning framework for detecting Paroxysmal Atrial Fibrillation (PxAF), a pathological characteristic of Electrocardiogram (ECG) that can lead to fatal conditions such as heart attack. To enhance the learning process, the framework involves a Generative Adversarial Network (GAN) along with a Neural Architecture Search (NAS) in the data preparation and classifier optimization phases. The GAN is innovatively invoked to overcome the class imbalance of the training data by producing the synthetic ECG for PxAF class in a certified manner. The effect of the certified GAN is statistically validated. Instead of using a general-purpose classifier, the NAS automatically designs a highly accurate convolutional neural network architecture customized for the PxAF classification task. Experimental results show that the accuracy of the proposed framework exhibits a high value of 99% which not only enhances state-of-the-art by up to 5.1%, but also improves the classification performance of the two widely-accepted baseline methods, ResNet-18, and Auto-Sklearn, by 2.2% and 6.1%.  ( 2 min )
    Dataset Bias in Human Activity Recognition. (arXiv:2301.10161v1 [eess.SP])
    When creating multi-channel time-series datasets for Human Activity Recognition (HAR), researchers are faced with the issue of subject selection criteria. It is unknown what physical characteristics and/or soft-biometrics, such as age, height, and weight, need to be taken into account to train a classifier to achieve robustness towards heterogeneous populations in the training and testing data. This contribution statistically curates the training data to assess to what degree the physical characteristics of humans influence HAR performance. We evaluate the performance of a state-of-the-art convolutional neural network on two HAR datasets that vary in the sensors, activities, and recording for time-series HAR. The training data is intentionally biased with respect to human characteristics to determine the features that impact motion behaviour. The evaluations brought forth the impact of the subjects' characteristics on HAR. Thus, providing insights regarding the robustness of the classifier with respect to heterogeneous populations. The study is a step forward in the direction of fair and trustworthy artificial intelligence by attempting to quantify representation bias in multi-channel time series HAR data.  ( 2 min )
    Domain generalization in deep learning-based mass detection in mammography: A large-scale multi-center study. (arXiv:2201.11620v2 [eess.IV] CROSS LISTED)
    Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this work, we explore the domain generalization of deep learning methods for mass detection in digital mammography and analyze in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compare the performance of eight state-of-the-art detection methods, including Transformer-based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline is designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalizes better than state-of-the-art transfer learning-based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis is performed to identify the covariate shifts with bigger effects on the detection performance, such as due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning-based breast cancer detection.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.
    LDMIC: Learning-based Distributed Multi-view Image Coding. (arXiv:2301.09799v1 [eess.IV])
    Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the cross-attention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to the geometric relationships between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed. Code will be released at https://github.com/Xinjie-Q/LDMIC.  ( 2 min )
    Explainable Deep Reinforcement Learning: State of the Art and Challenges. (arXiv:2301.09937v1 [cs.LG])
    Interpretability, explainability and transparency are key issues to introducing Artificial Intelligence methods in many critical domains: This is important due to ethical concerns and trust issues strongly connected to reliability, robustness, auditability and fairness, and has important consequences towards keeping the human in the loop in high levels of automation, especially in critical cases for decision making, where both (human and the machine) play important roles. While the research community has given much attention to explainability of closed (or black) prediction boxes, there are tremendous needs for explainability of closed-box methods that support agents to act autonomously in the real world. Reinforcement learning methods, and especially their deep versions, are such closed-box methods. In this article we aim to provide a review of state of the art methods for explainable deep reinforcement learning methods, taking also into account the needs of human operators - i.e., of those that take the actual and critical decisions in solving real-world problems. We provide a formal specification of the deep reinforcement learning explainability problems, and we identify the necessary components of a general explainable reinforcement learning framework. Based on these, we provide a comprehensive review of state of the art methods, categorizing them in classes according to the paradigm they follow, the interpretable models they use, and the surface representation of explanations provided. The article concludes identifying open questions and important challenges.  ( 2 min )
    Predicting Socio-Economic Well-being Using Mobile Apps Data: A Case Study of France. (arXiv:2301.09986v1 [cs.CY])
    Socio-economic indicators provide context for assessing a country's overall condition. These indicators contain information about education, gender, poverty, employment, and other factors. Therefore, reliable and accurate information is critical for social research and government policing. Most data sources available today, such as censuses, have sparse population coverage or are updated infrequently. Nonetheless, alternative data sources, such as call data records (CDR) and mobile app usage, can serve as cost-effective and up-to-date sources for identifying socio-economic indicators. This work investigates mobile app data to predict socio-economic features. We present a large-scale study using data that captures the traffic of thousands of mobile applications by approximately 30 million users distributed over 550,000 km square and served by over 25,000 base stations. The dataset covers the whole France territory and spans more than 2.5 months, starting from 16th March 2019 to 6th June 2019. Using the app usage patterns, our best model can estimate socio-economic indicators (attaining an R-squared score upto 0.66). Furthermore, using models' explainability, we discover that mobile app usage patterns have the potential to reveal socio-economic disparities in IRIS. Insights of this study provide several avenues for future interventions, including users' temporal network analysis and exploration of alternative data sources.  ( 2 min )
    Investigating Labeler Bias in Face Annotation for Machine Learning. (arXiv:2301.09902v1 [cs.LG])
    In a world increasingly reliant on artificial intelligence, it is more important than ever to consider the ethical implications of artificial intelligence on humanity. One key under-explored challenge is labeler bias, which can create inherently biased datasets for training and subsequently lead to inaccurate or unfair decisions in healthcare, employment, education, and law enforcement. Hence, we conducted a study to investigate and measure the existence of labeler bias using images of people from different ethnicities and sexes in a labeling task. Our results show that participants possess stereotypes that influence their decision-making process and that labeler demographics impact assigned labels. We also discuss how labeler bias influences datasets and, subsequently, the models trained on them. Overall, a high degree of transparency must be maintained throughout the entire artificial intelligence training process to identify and correct biases in the data as early as possible.  ( 2 min )
    Membership Inference of Diffusion Models. (arXiv:2301.09956v1 [cs.CR])
    Recent years have witnessed the tremendous success of diffusion models in data synthesis. However, when diffusion models are applied to sensitive data, they also give rise to severe privacy concerns. In this paper, we systematically present the first study about membership inference attacks against diffusion models, which aims to infer whether a sample was used to train the model. Two attack methods are proposed, namely loss-based and likelihood-based attacks. Our attack methods are evaluated on several state-of-the-art diffusion models, over different datasets in relation to privacy-sensitive data. Extensive experimental evaluations show that our attacks can achieve remarkable performance. Furthermore, we exhaustively investigate various factors which can affect attack performance. Finally, we also evaluate the performance of our attack methods on diffusion models trained with differential privacy.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks: Applications in neutron transport. (arXiv:2301.09991v1 [cs.CE])
    In this paper we solve the Boltzmann transport equation using AI libraries. The reason why this is attractive is because it enables one to use the highly optimised software within AI libraries, enabling one to run on different computer architectures and enables one to tap into the vast quantity of community based software that has been developed for AI and ML applications e.g. mixed arithmetic precision or model parallelism. Here we take the first steps towards developing this approach for the Boltzmann transport equation and develop the necessary methods in order to do that effectively. This includes: 1) A space-angle multigrid solution method that can extract the level of parallelism necessary to run efficiently on GPUs or new AI computers. 2) A new Convolutional Finite Element Method (ConvFEM) that greatly simplifies the implementation of high order finite elements (quadratic to quintic, say). 3) A new non-linear Petrov-Galerkin method that introduces dissipation anisotropically.  ( 2 min )
    Fair and skill-diverse student group formation via constrained k-way graph partitioning. (arXiv:2301.09984v1 [cs.LG])
    Forming the right combination of students in a group promises to enable a powerful and effective environment for learning and collaboration. However, defining a group of students is a complex task which has to satisfy multiple constraints. This work introduces an unsupervised algorithm for fair and skill-diverse student group formation. This is achieved by taking account of student course marks and sensitive attributes provided by the education office. The skill sets of students are determined using unsupervised dimensionality reduction of course mark data via the Laplacian eigenmap. The problem is formulated as a constrained graph partitioning problem, whereby the diversity of skill sets in each group are maximised, group sizes are upper and lower bounded according to available resources, and `balance' of a sensitive attribute is lower bounded to enforce fairness in group formation. This optimisation problem is solved using integer programming and its effectiveness is demonstrated on a dataset of student course marks from Imperial College London.  ( 2 min )
    The Backpropagation algorithm for a math student. (arXiv:2301.09977v1 [cs.LG])
    A Deep Neural Network (DNN) is a composite function of vector-valued functions, and in order to train a DNN, it is necessary to calculate the gradient of the loss function with respect to all parameters. This calculation can be a non-trivial task because the loss function of a DNN is a composition of several nonlinear functions, each with numerous parameters. The Backpropagation (BP) algorithm leverages the composite structure of the DNN to efficiently compute the gradient. As a result, the number of layers in the network does not significantly impact the complexity of the calculation. The objective of this paper is to express the gradient of the loss function in terms of a matrix multiplication using the Jacobian operator. This can be achieved by considering the total derivative of each layer with respect to its parameters and expressing it as a Jacobian matrix. The gradient can then be represented as the matrix product of these Jacobian matrices. This approach is valid because the chain rule can be applied to a composition of vector-valued functions, and the use of Jacobian matrices allows for the incorporation of multiple inputs and outputs. By providing concise mathematical justifications, the results can be made understandable and useful to a broad audience from various disciplines.  ( 2 min )
    Probabilistic Bilevel Coreset Selection. (arXiv:2301.09880v1 [cs.LG])
    The goal of coreset selection in supervised learning is to produce a weighted subset of data, so that training only on the subset achieves similar performance as training on the entire dataset. Existing methods achieved promising results in resource-constrained scenarios such as continual learning and streaming. However, most of the existing algorithms are limited to traditional machine learning models. A few algorithms that can handle large models adopt greedy search approaches due to the difficulty in solving the discrete subset selection problem, which is computationally costly when coreset becomes larger and often produces suboptimal results. In this work, for the first time we propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample. The overall objective is posed as a bilevel optimization problem, where 1) the inner loop samples coresets and train the model to convergence and 2) the outer loop updates the sample probability progressively according to the model's performance. Importantly, we develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation. We provide the convergence property of our training procedure and demonstrate the superiority of our algorithm against various coreset selection methods in various tasks, especially in more challenging label-noise and class-imbalance scenarios.  ( 2 min )
    A two stages Deep Learning Architecture for Model Reduction of Parametric Time-Dependent Problems. (arXiv:2301.09926v1 [math.NA])
    Parametric time-dependent systems are of a crucial importance in modeling real phenomena, often characterized by non-linear behaviors too. Those solutions are typically difficult to generalize in a sufficiently wide parameter space while counting on limited computational resources available. As such, we present a general two-stages deep learning framework able to perform that generalization with low computational effort in time. It consists in a separated training of two pipe-lined predictive models. At first, a certain number of independent neural networks are trained with data-sets taken from different subsets of the parameter space. Successively, a second predictive model is specialized to properly combine the first-stage guesses and compute the right predictions. Promising results are obtained applying the framework to incompressible Navier-Stokes equations in a cavity (Rayleigh-Bernard cavity), obtaining a 97% reduction in the computational time comparing with its numerical resolution for a new value of the Grashof number.  ( 2 min )
    Neighborhood Homophily-Guided Graph Convolutional Network. (arXiv:2301.09851v1 [cs.LG])
    Graph neural networks (GNNs) have achieved remarkable advances in graph-oriented tasks. However, many real-world graphs contain heterophily or low homophily, challenging the homophily assumption of classical GNNs and resulting in low performance. Although many studies have emerged to improve the universality of GNNs, they rarely consider the label reuse and the correlation of their proposed metrics and models. In this paper, we first design a new metric, named Neighborhood Homophily (\textit{NH}), to measure the label complexity or purity in the neighborhood of nodes. Furthermore, we incorporate this metric into the classical graph convolutional network (GCN) architecture and propose \textbf{N}eighborhood \textbf{H}omophily-\textbf{G}uided \textbf{G}raph \textbf{C}onvolutional \textbf{N}etwork (\textbf{NHGCN}). In this framework, nodes are grouped by estimated \textit{NH} values to achieve intra-group weight sharing during message propagation and aggregation. Then the generated node predictions are used to estimate and update new \textit{NH} values. The two processes of metric estimation and model inference are alternately optimized to achieve better node classification. Extensive experiments on both homophilous and heterophilous benchmarks demonstrate that \textbf{NHGCN} achieves state-of-the-art overall performance on semi-supervised node classification for the universality problem.  ( 2 min )
    Learning To Dive In Branch And Bound. (arXiv:2301.09943v1 [cs.LG])
    Primal heuristics are important for solving mixed integer linear programs, because they find feasible solutions that facilitate branch and bound search. A prominent group of primal heuristics are diving heuristics. They iteratively modify and resolve linear programs to conduct a depth-first search from any node in the search tree. Existing divers rely on generic decision rules that fail to exploit structural commonality between similar problem instances that often arise in practice. Therefore, we propose L2Dive to learn specific diving heuristics with graph neural networks: We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions based on the model's predictions. L2Dive is fully integrated into the open-source solver SCIP. We find that L2Dive outperforms standard divers to find better feasible solutions on a range of combinatorial optimization problems. For real-world applications from server load balancing and neural network verification, L2Dive improves the primal-dual integral by up to 7% (35%) on average over a tuned (default) solver baseline and reduces average solving time by 20% (29%).  ( 2 min )
    Same or Different? Diff-Vectors for Authorship Analysis. (arXiv:2301.09862v1 [cs.LG])
    We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship verification) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks. (arXiv:2301.09939v1 [cs.CE])
    This paper presents a new approach which uses the tools within Artificial Intelligence (AI) software libraries as an alternative way of solving partial differential equations (PDEs) that have been discretised using standard numerical methods. In particular, we describe how to represent numerical discretisations arising from the finite volume and finite element methods by pre-determining the weights of convolutional layers within a neural network. As the weights are defined by the discretisation scheme, no training of the network is required and the solutions obtained are identical (accounting for solver tolerances) to those obtained with standard codes often written in Fortran or C++. We also explain how to implement the Jacobi method and a multigrid solver using the functions available in AI libraries. For the latter, we use a U-Net architecture which is able to represent a sawtooth multigrid method. A benefit of using AI libraries in this way is that one can exploit their power and their built-in technologies. For example, their executions are already optimised for different computer architectures, whether it be CPUs, GPUs or new-generation AI processors. In this article, we apply the proposed approach to eigenvalue problems in reactor physics where neutron transport is described by diffusion theory. For a fuel assembly benchmark, we demonstrate that the solution obtained from our new approach is the same (accounting for solver tolerances) as that obtained from the same discretisation coded in a standard way using Fortran. We then proceed to solve a reactor core benchmark using the new approach.  ( 2 min )
    A Stability Analysis of Fine-Tuning a Pre-Trained Model. (arXiv:2301.09820v1 [cs.LG])
    Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Noisy Parallel Data Alignment. (arXiv:2301.09685v1 [cs.CL])
    An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.  ( 2 min )
    Data Augmentation Alone Can Improve Adversarial Training. (arXiv:2301.09879v1 [cs.CV])
    Adversarial training suffers from the issue of robust overfitting, which seriously impairs its generalization performance. Data augmentation, which is effective at preventing overfitting in standard training, has been observed by many previous works to be ineffective in mitigating overfitting in adversarial training. This work proves that, contrary to previous findings, data augmentation alone can significantly boost accuracy and robustness in adversarial training. We find that the hardness and the diversity of data augmentation are important factors in combating robust overfitting. In general, diversity can improve both accuracy and robustness, while hardness can boost robustness at the cost of accuracy within a certain limit and degrade them both over that limit. To mitigate robust overfitting, we first propose a new crop transformation, Cropshift, which has improved diversity compared to the conventional one (Padcrop). We then propose a new data augmentation scheme, based on Cropshift, with much improved diversity and well-balanced hardness. Empirically, our augmentation method achieves the state-of-the-art accuracy and robustness for data augmentations in adversarial training. Furthermore, when combined with weight averaging it matches, or even exceeds, the performance of the best contemporary regularization methods for alleviating robust overfitting. Code is available at: https://github.com/TreeLLi/DA-Alone-Improves-AT.  ( 2 min )
    Topological Structure is Predictive of Deep Neural Network Success in Learning. (arXiv:2301.09734v1 [cs.LG])
    Machine learning has become a fundamental tool in modern science, yet its limitations are still not fully understood. Using a simple children's game, we show that the topological structure of the underlying training data can have a dramatic effect on the ability of a deep neural network (DNN) classifier to learn to classify data. We then take insights obtained from this toy model and apply them to two physical data sets (one from particle physics and one from acoustics), which are known to be amenable to classification by DNN's. We show that the simplicity in their topological structure explains the majority of the DNN's ability to operate on these data sets by showing that fully interpretable topological classifiers are able to perform nearly as well as their DNN counterparts.  ( 2 min )
    Slice-and-Forge: Making Better Use of Caches for Graph Convolutional Network Accelerators. (arXiv:2301.09813v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they can process a wide variety of data formats that prior deep neural networks cannot easily support. One key challenge in designing hardware accelerators for GCNs is the vast size and randomness in their data access patterns which greatly reduces the effectiveness of the limited on-chip cache. Aimed at improving the effectiveness of the cache by mitigating the irregular data accesses, prior studies often employ the vertex tiling techniques used in traditional graph processing applications. While being effective at enhancing the cache efficiency, those approaches are often sensitive to the tiling configurations where the optimal setting heavily depends on target input datasets. Furthermore, the existing solutions require manual tuning through trial-and-error or rely on sub-optimal analytical models. In this paper, we propose Slice-and-Forge (SnF), an efficient hardware accelerator for GCNs which greatly improves the effectiveness of the limited on-chip cache. SnF chooses a tiling strategy named feature slicing that splits the features into vertical slices and processes them in the outermost loop of the execution. This particular choice results in a repetition of the identical computational patterns over irregular graph data over multiple rounds. Taking advantage of such repetitions, SnF dynamically tunes its tile size. Our experimental results reveal that SnF can achieve 1.73x higher performance in geomean compared to prior work on multi-engine settings, and 1.46x higher performance in geomean on small scale settings, without the need for off-line analyses.  ( 2 min )
    Gossiped and Quantized Online Multi-Kernel Learning. (arXiv:2301.09848v1 [cs.LG])
    In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Backdoor Attacks in Peer-to-Peer Federated Learning. (arXiv:2301.09732v1 [cs.LG])
    We study backdoor attacks in peer-to-peer federated learning systems on different graph topologies and datasets. We show that only 5% attacker nodes are sufficient to perform a backdoor attack with 42% attack success without decreasing the accuracy on clean data by more than 2%. We also demonstrate that the attack can be amplified by the attacker crashing a small number of nodes. We evaluate defenses proposed in the context of centralized federated learning and show they are ineffective in peer-to-peer settings. Finally, we propose a defense that mitigates the attacks by applying different clipping norms to the model updates received from peers and local model trained by a node.  ( 2 min )
    Truveta Mapper: A Zero-shot Ontology Alignment Framework. (arXiv:2301.09767v1 [cs.LG])
    In this paper, a new perspective is suggested for unsupervised Ontology Matching (OM) or Ontology Alignment (OA) by treating it as a translation task. Ontologies are represented as graphs, and the translation is performed from a node in the source ontology graph to a path in the target ontology graph. The proposed framework, Truveta Mapper (TM), leverages a multi-task sequence-to-sequence transformer model to perform alignment across multiple ontologies in a zero-shot, unified and end-to-end manner. Multi-tasking enables the model to implicitly learn the relationship between different ontologies via transfer-learning without requiring any explicit cross-ontology manually labeled data. This also enables the formulated framework to outperform existing solutions for both runtime latency and alignment quality. The model is pre-trained and fine-tuned only on publicly available text corpus and inner-ontologies data. The proposed solution outperforms state-of-the-art approaches, Edit-Similarity, LogMap, AML, BERTMap, and the recently presented new OM frameworks in Ontology Alignment Evaluation Initiative (OAEI22), offers log-linear complexity in contrast to quadratic in the existing end-to-end methods, and overall makes the OM task efficient and more straightforward without much post-processing involving mapping extension or mapping repair.  ( 2 min )
    Constrained Reinforcement Learning for Dexterous Manipulation. (arXiv:2301.09766v1 [cs.RO])
    Existing learning approaches to dexterous manipulation use demonstrations or interactions with the environment to train black-box neural networks that provide little control over how the robot learns the skills or how it would perform post training. These approaches pose significant challenges when implemented on physical platforms given that, during initial stages of training, the robot's behavior could be erratic and potentially harmful to its own hardware, the environment, or any humans in the vicinity. A potential way to address these limitations is to add constraints during learning that restrict and guide the robot's behavior during training as well as roll outs. Inspired by the success of constrained approaches in other domains, we investigate the effects of adding position-based constraints to a 24-DOF robot hand learning to perform object relocation using Constrained Policy Optimization. We find that a simple geometric constraint can ensure the robot learns to move towards the object sooner than without constraints. Further, training with this constraint requires a similar number of samples as its unconstrained counterpart to master the skill. These findings shed light on how simple constraints can help robots achieve sensible and safe behavior quickly and ease concerns surrounding hardware deployment. We also investigate the effects of the strictness of these constraints and report findings that provide insights into how different degrees of strictness affect learning outcomes. Our code is available at https://github.com/GT-STAR-Lab/constrained-rl-dexterous-manipulation.  ( 2 min )
    DODEM: DOuble DEfense Mechanism Against Adversarial Attacks Towards Secure Industrial Internet of Things Analytics. (arXiv:2301.09740v1 [cs.CR])
    Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors, and networking equipment to monitor and collect data from industrial operations. Machine learning (ML) methods use this data to make high-level decisions with minimal human intervention. Data-driven predictive maintenance (PDM) is a crucial ML-based I-IoT application to find an optimal maintenance schedule for industrial assets. The performance of these ML methods can seriously be threatened by adversarial attacks where an adversary crafts perturbed data and sends it to the ML model to deteriorate its prediction performance. The models should be able to stay robust against these attacks where robustness is measured by how much perturbation in input data affects model performance. Hence, there is a need for effective defense mechanisms that can protect these models against adversarial attacks. In this work, we propose a double defense mechanism to detect and mitigate adversarial attacks in I-IoT environments. We first detect if there is an adversarial attack on a given sample using novelty detection algorithms. Then, based on the outcome of our algorithm, marking an instance as attack or normal, we select adversarial retraining or standard training to provide a secondary defense layer. If there is an attack, adversarial retraining provides a more robust model, while we apply standard training for regular samples. Since we may not know if an attack will take place, our adaptive mechanism allows us to consider irregular changes in data. The results show that our double defense strategy is highly efficient where we can improve model robustness by up to 64.6% and 52% compared to standard and adversarial retraining, respectively.  ( 2 min )
    Long-term stable Electromyography classification using Canonical Correlation Analysis. (arXiv:2301.09729v1 [cs.LG])
    Discrimination of hand gestures based on the decoding of surface electromyography (sEMG) signals is a well-establish approach for controlling prosthetic devices and for Human-Machine Interfaces (HMI). However, despite the promising results achieved by this approach in well-controlled experimental conditions, its deployment in long-term real-world application scenarios is still hindered by several challenges. One of the most critical challenges is maintaining high EMG data classification performance across multiple days without retraining the decoding system. The drop in performance is mostly due to the high EMG variability caused by electrodes shift, muscle artifacts, fatigue, user adaptation, or skin-electrode interfacing issues. Here we propose a novel statistical method based on canonical correlation analysis (CCA) that stabilizes EMG classification performance across multiple days for long-term control of prosthetic devices. We show how CCA can dramatically decrease the performance drop of standard classifiers observed across days, by maximizing the correlation among multiple-day acquisition data sets. Our results show how the performance of a classifier trained on EMG data acquired only of the first day of the experiment maintains 90% relative accuracy across multiple days, compensating for the EMG data variability that occurs over long-term periods, using the CCA transformation on data obtained from a small number of gestures. This approach eliminates the need for large data sets and multiple or periodic training sessions, which currently hamper the usability of conventional pattern recognition based approaches  ( 2 min )
    Earthquake Magnitude and b value prediction model using Extreme Learning Machine. (arXiv:2301.09756v1 [physics.geo-ph])
    Earthquake prediction has been a challenging research area for many decades, where the future occurrence of this highly uncertain calamity is predicted. In this paper, several parametric and non-parametric features were calculated, where the non-parametric features were calculated using the parametric features. $8$ seismic features were calculated using Gutenberg-Richter law, the total recurrence, and the seismic energy release. Additionally, criterions such as Maximum Relevance and Maximum Redundancy were applied to choose the pertinent features. These features along with others were used as input for an Extreme Learning Machine (ELM) Regression Model. Magnitude and time data of $5$ decades from the Assam-Guwahati region were used to create this model for magnitude prediction. The Testing Accuracy and Testing Speed were computed taking the Root Mean Squared Error (RMSE) as the parameter for evaluating the mode. As confirmed by the results, ELM shows better scalability with much faster training and testing speed (up to a thousand times faster) than traditional Support Vector Machines. The testing RMSE came out to be around $0.097$. To further test the model's robustness -- magnitude-time data from California was used to calculate the seismic indicators which were then fed into an ELM and then tested on the Assam-Guwahati region. The model proves to be robust and can be implemented in early warning systems as it continues to be a major part of Disaster Response and management.  ( 2 min )
    Two-Stage Learning For the Flexible Job Shop Scheduling Problem. (arXiv:2301.09703v1 [cs.AI])
    The Flexible Job-shop Scheduling Problem (FJSP) is an important combinatorial optimization problem that arises in manufacturing and service settings. FJSP is composed of two subproblems, an assignment problem that assigns tasks to machines, and a scheduling problem that determines the starting times of tasks on their chosen machines. Solving FJSP instances of realistic size and composition is an ongoing challenge even under simplified, deterministic assumptions. Motivated by the inevitable randomness and uncertainties in supply chains, manufacturing, and service operations, this paper investigates the potential of using a deep learning framework to generate fast and accurate approximations for FJSP. In particular, this paper proposes a two-stage learning framework 2SLFJSP that explicitly models the hierarchical nature of FJSP decisions, uses a confidence-aware branching scheme to generate appropriate instances for the scheduling stage from the assignment predictions and leverages a novel symmetry-breaking formulation to improve learnability. 2SL-FJSP is evaluated on instances from the FJSP benchmark library. Results show that 2SL-FJSP can generate high-quality solutions in milliseconds, outperforming a state-of-the-art reinforcement learning approach recently proposed in the literature, and other heuristics commonly used in practice.  ( 2 min )
    Implementation of the Critical Wave Groups Method with Computational Fluid Dynamics and Neural Networks. (arXiv:2301.09834v1 [physics.flu-dyn])
    Accurate and efficient prediction of extreme ship responses continues to be a challenging problem in ship hydrodynamics. Probabilistic frameworks in conjunction with computationally efficient numerical hydrodynamic tools have been developed that allow researchers and designers to better understand extremes. However, the ability of these hydrodynamic tools to represent the physics quantitatively during extreme events is limited. Previous research successfully implemented the critical wave groups (CWG) probabilistic method with computational fluid dynamics (CFD). Although the CWG method allows for less simulation time than a Monte Carlo approach, the large quantity of simulations required is cost prohibitive. The objective of the present paper is to reduce the computational cost of implementing CWG with CFD, through the construction of long short-term memory (LSTM) neural networks. After training the models with a limited quantity of simulations, the models can provide a larger quantity of predictions to calculate the probability. The new framework is demonstrated with a 2-D midship section of the Office of Naval Research Tumblehome (ONRT) hull in Sea State 7 and beam seas at zero speed. The new framework is able to produce predictions that are representative of a purely CFD-driven CWG framework, with two orders of magnitude of computational cost savings.  ( 2 min )
    Long-tail Detection with Effective Class-Margins. (arXiv:2301.09724v1 [cs.CV])
    Large-scale object detection and instance segmentation face a severe data imbalance. The finer-grained object classes become, the less frequent they appear in our datasets. However, at test-time, we expect a detector that performs well for all classes and not just the most frequent ones. In this paper, we provide a theoretical understanding of the long-trail detection problem. We show how the commonly used mean average precision evaluation metric on an unknown test set is bound by a margin-based binary classification error on a long-tailed object detection training set. We optimize margin-based binary classification error with a novel surrogate objective called \textbf{Effective Class-Margin Loss} (ECM). The ECM loss is simple, theoretically well-motivated, and outperforms other heuristic counterparts on LVIS v1 benchmark over a wide range of architecture and detectors. Code is available at \url{https://github.com/janghyuncho/ECM-Loss}.  ( 2 min )
    Graph Neural Networks for Decentralized Multi-Agent Perimeter Defense. (arXiv:2301.09689v1 [cs.MA])
    In this work, we study the problem of decentralized multi-agent perimeter defense that asks for computing actions for defenders with local perceptions and communications to maximize the capture of intruders. One major challenge for practical implementations is to make perimeter defense strategies scalable for large-scale problem instances. To this end, we leverage graph neural networks (GNNs) to develop an imitation learning framework that learns a mapping from defenders' local perceptions and their communication graph to their actions. The proposed GNN-based learning network is trained by imitating a centralized expert algorithm such that the learned actions are close to that generated by the expert algorithm. We demonstrate that our proposed network performs closer to the expert algorithm and is superior to other baseline algorithms by capturing more intruders. Our GNN-based network is trained at a small scale and can be generalized to large-scale cases. We run perimeter defense games in scenarios with different team sizes and configurations to demonstrate the performance of the learned network.  ( 2 min )
    Illumination Variation Correction Using Image Synthesis For Unsupervised Domain Adaptive Person Re-Identification. (arXiv:2301.09702v1 [eess.IV])
    Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.  ( 2 min )
    PRIMEQA: The Prime Repository for State-of-the-Art MultilingualQuestion Answering Research and Development. (arXiv:2301.09715v1 [cs.CL])
    The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.  ( 2 min )
    On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation. (arXiv:2301.09709v1 [cs.LG])
    A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.  ( 2 min )
    Weakly-Supervised Questions for Zero-Shot Relation Extraction. (arXiv:2301.09640v1 [cs.CL])
    Zero-Shot Relation Extraction (ZRE) is the task of Relation Extraction where the training and test sets have no shared relation types. This very challenging domain is a good test of a model's ability to generalize. Previous approaches to ZRE reframed relation extraction as Question Answering (QA), allowing for the use of pre-trained QA models. However, this method required manually creating gold question templates for each new relation. Here, we do away with these gold templates and instead learn a model that can generate questions for unseen relations. Our technique can successfully translate relation descriptions into relevant questions, which are then leveraged to generate the correct tail entity. On tail entity extraction, we outperform the previous state-of-the-art by more than 16 F1 points without using gold question templates. On the RE-QA dataset where no previous baseline for relation extraction exists, our proposed algorithm comes within 0.7 F1 points of a system that uses gold question templates. Our model also outperforms the state-of-the-art ZRE baselines on the FewRel and WikiZSL datasets, showing that QA models no longer need template questions to match the performance of models specifically tailored to the ZRE task. Our implementation is available at https://github.com/fyshelab/QA-ZRE.  ( 2 min )
    Selective Explanations: Leveraging Human Input to Align Explainable AI. (arXiv:2301.09656v1 [cs.AI])
    While a vast collection of explainable AI (XAI) algorithms have been developed in recent years, they are often criticized for significant gaps with how humans produce and consume explanations. As a result, current XAI techniques are often found to be hard to use and lack effectiveness. In this work, we attempt to close these gaps by making AI explanations selective -- a fundamental property of human explanations -- by selectively presenting a subset from a large set of model reasons based on what aligns with the recipient's preferences. We propose a general framework for generating selective explanations by leveraging human input on a small sample. This framework opens up a rich design space that accounts for different selectivity goals, types of input, and more. As a showcase, we use a decision-support task to explore selective explanations based on what the decision-maker would consider relevant to the decision task. We conducted two experimental studies to examine three out of a broader possible set of paradigms based on our proposed framework: in Study 1, we ask the participants to provide their own input to generate selective explanations, with either open-ended or critique-based input. In Study 2, we show participants selective explanations based on input from a panel of similar users (annotators). Our experiments demonstrate the promise of selective explanations in reducing over-reliance on AI and improving decision outcomes and subjective perceptions of the AI, but also paint a nuanced picture that attributes some of these positive effects to the opportunity to provide one's own input to augment AI explanations. Overall, our work proposes a novel XAI framework inspired by human communication behaviors and demonstrates its potentials to encourage future work to better align AI explanations with human production and consumption of explanations.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints. (arXiv:2301.09642v1 [q-bio.QM])
    Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuitable for the constrained protein inpainting problem, where the model needs to recover masked structures conditioned on unmasked ones, as it dramatically increases the computing cost of geometric constraints. To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic \textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model, converting invariant backbone angles into equivalent direction vectors and preserving the simplicity, called Seq2Direct encoder ($\text{Enc}_{s2d}$). Geometric constraints could be efficiently imposed on the newly introduced direction space. A Direct2Seq decoder ($\text{Dec}_{d2s}$) with mathematical guarantees is also introduced to develop a \textbf{SDS} ($\text{Enc}_{s2d}$+$\text{Dec}_{d2s}$) model. We apply the SDS model as the denoising neural network during the conditional diffusion process, resulting in a constrained generative model--\textbf{DiffSDS}. Extensive experiments show that the plug-and-play ADS could transform the language model into a strong structural model without loss of simplicity. More importantly, the proposed DiffSDS outperforms previous strong baselines by a large margin on the task of protein inpainting.  ( 2 min )
  • Open

    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.  ( 2 min )
    Multiway Spherical Clustering via Degree-Corrected Tensor Block Models. (arXiv:2201.07401v2 [math.ST] UPDATED)
    We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.  ( 2 min )
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking. (arXiv:2104.02943v3 [math.ST] UPDATED)
    The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.  ( 2 min )
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a Neural Point Process. (arXiv:2301.09948v1 [physics.geo-ph])
    Point processes have been dominant in modeling the evolution of seismicity for decades, with the Epidemic Type Aftershock Sequence (ETAS) model being most popular. Recent advances in machine learning have constructed highly flexible point process models using neural networks to improve upon existing parametric models. We investigate whether these flexible point process models can be applied to short-term seismicity forecasting by extending an existing temporal neural model to the magnitude domain and we show how this model can forecast earthquakes above a target magnitude threshold. We first demonstrate that the neural model can fit synthetic ETAS data, however, requiring less computational time because it is not dependent on the full history of the sequence. By artificially emulating short-term aftershock incompleteness in the synthetic dataset, we find that the neural model outperforms ETAS. Using a new enhanced catalog from the 2016-2017 Central Apennines earthquake sequence, we investigate the predictive skill of ETAS and the neural model with respect to the lowest input magnitude. Constructing multiple forecasting experiments using the Visso, Norcia and Campotosto earthquakes to partition training and testing data, we target M3+ events. We find both models perform similarly at previously explored thresholds (e.g., above M3), but lowering the threshold to M1.2 reduces the performance of ETAS unlike the neural model. We argue that some of these gains are due to the neural model's ability to handle incomplete data. The robustness to missing data and speed to train the neural model present it as an encouraging competitor in earthquake forecasting.  ( 2 min )
    Adaptive Probabilistic Forecasting of Electricity (Net-)Load. (arXiv:2301.10090v1 [stat.AP])
    We focus on electricity load forecasting under three important specificities. First, our setting is adaptive; we use models taking into account the most recent observations available, yielding a forecasting strategy able to automatically respond to regime changes. Second, we consider probabilistic rather than point forecasting; indeed, uncertainty quantification is required to operate electricity systems efficiently and reliably. Third, we consider both conventional load (consumption only) and netload (consumption less embedded generation). Our methodology relies on the Kalman filter, previously used successfully for adaptive point load forecasting. The probabilistic forecasts are obtained by quantile regressions on the residuals of the point forecasting model. We achieve adaptive quantile regressions using the online gradient descent; we avoid the choice of the gradient step size considering multiple learning rates and aggregation of experts. We apply the method to two data sets: the regional net-load in Great Britain and the demand of seven large cities in the United States. Adaptive procedures improve forecast performance substantially in both use cases and for both point and probabilistic forecasting.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    Improved Rate of First Order Algorithms for Entropic Optimal Transport. (arXiv:2301.09675v1 [math.OC])
    This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special design helps us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing. To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{{O}}({n^2}/{\epsilon})$ for solving OT.  ( 2 min )

  • Open

    Just found a new chrome extension called IntelliMail that uses AI to write emails. Its super easy to use and can be used to land internships, jobs and up your email game.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I created an 'AI Tools' series on YouTube and I'd love your feedback! Today is About Jasper AI
    I would love to know your opinion about what I should improve on mt videos and of course if you don't know about Jasper AI give a look, it's a great AI tool for Content Creation https://youtu.be/x_6rzsBVABg submitted by /u/sigmabruuh [link] [comments]  ( 6 min )
    OpenAi's breakthrough
    https://twitter.com/make_mhe/status/1618255363580755968 submitted by /u/bradasm [link] [comments]  ( 40 min )
    I asked an AI image generator to show me a "typical Discord user"
    submitted by /u/MSAPW [link] [comments]  ( 40 min )
    AI Dream 150 - SUPERNOVA IMMINENT Part1 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    I am recommending you something
    It's an amazing app called Simplywrite. It gives you 20 credits for free. It lets you generate articles and outlines. Totally a free tool to use when you bored. LINK TO APP: https://simplywrite.com submitted by /u/benignkirby [link] [comments]  ( 40 min )
    Anonymize tabular data to meet GDPR privacy requirements
    submitted by /u/Repeat-or [link] [comments]  ( 40 min )
    I made a list of tools powered by AI
    submitted by /u/Alen0tv [link] [comments]  ( 40 min )
    How to automatically generate test cases in NLP?
    submitted by /u/Nazma2015 [link] [comments]  ( 40 min )
    Best TTS for cheesy movie trailer guy voice?
    Looking for a TTS that can generate that really deep movie trailer male voice. Any ideas? submitted by /u/infinitycurvature [link] [comments]  ( 40 min )
    Best video style transfer program as of now?
    Looking something where you can capture a face speaking and replace it with a completely different character (including neck, face, shoulders). Looking to replace a human with an animated character. What options are there as of now? I know ebsynth, but it's not too good. There is vtuber software, but they look too "anime". Looking for something that would give a very "photorealistic" result, think Avatar quality output or a Pixar movie. What is the best that's out there right now? submitted by /u/UpperStruggle2421 [link] [comments]  ( 41 min )
    Beware Loab, the digital cryptid lurking in AI's forgotten space
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    Will coders and writers just be doing QA of AI output?
    So 5 years ago, the main career and hobby advice I had was "look to see if programming is a thing for you, it's at least a fun hobby and maybe you can find a niche as one of the last of the custom craftspeople" But now... where do you go, in a society where wealth tends to get sucked upward, to find a good living? Do you learn to roll with being a bot manager? (Assuming the AIs don't just manage themselves?) Do you look for trades of manipulating physical stuff that haven't yet lent themselves to cybernetics? submitted by /u/kirk_is_ [link] [comments]  ( 44 min )
    merge photo with ai
    hi everyone now i have two different photos (not image) and how can i merge thats with using ai. submitted by /u/Aigerim_D40 [link] [comments]  ( 40 min )
    What is your preferred Image Generation API / App?
    It is really difficult to benchmark Text-to-Image AIs, it relies on so many aspects: speed, styles, precision of the prompt, interface, fine-tuning, etc. So I think the best approach is to see which are the most prefered by the people who use Stable Diffusion API. Do not hesitate to explain your choice in comments, and also mention APIs that are not in the Poll, I am limited to 6 options... I know that I did not put Midjourney, Artbreeder, Stable Diffusion, NightCafe, Crayion, Starry AI and many other but I am interested in those which provide API only. PS: this isn't promotional at all, I am not working for any of those companies. View Poll submitted by /u/JerLam2762 [link] [comments]  ( 42 min )
    I asked an AI image generator to show me a "typical Reddit user"
    submitted by /u/MrsChenHW [link] [comments]  ( 41 min )
    Yann LeCun, Meta’s Chief AI Scientist, Has Some Harsh Criticism Of ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    A New Free AI Model InstructPix2Pix To Transform Images By Plain English Instructions
    submitted by /u/CeFurkan [link] [comments]  ( 40 min )
    Fearing ChatGPT, Google enlists founders Brin and Page in AI fight
    submitted by /u/SAT0725 [link] [comments]  ( 40 min )
    Being really humorous under the pressure of billions of prompt requests
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Humanity's Quest to Decode Animal Languages Through AI
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    The Connection Between Science Fiction and Artificial Intelligence: A Survey Study
    Hello everyone, I am a student in AP Research. For my project, I am conducting a survey to analyze the connection between science fiction and technology (specifically Artificial Intelligence). This survey (linked) asks a few questions about your knowledge of Sci-fi, Artificial Intelligence, and the connection between the two. It should not take more than 10 minutes of your time. If you are interested, the link to the form is below: https://docs.google.com/forms/d/e/1FAIpQLScY_VaNI-CEtTiJiLHgYCCguEZ7m9DUdQoxvFTjXFFLOGu2KA/viewform If you have additional questions, my email is in the linked google form. Thank you for your participation, it is deeply appreciated! submitted by /u/rsantos05 [link] [comments]  ( 41 min )
  • Open

    [D] Alphatensor benchmark code in Colab
    Hello everybody I was wondering if anybody tried to run the main factorisation code https://github.com/deepmind/alphatensor/blob/main/benchmarking/factorizations.py from alpha tensor on Google Colab, with Colab's GPUs ( Tesla T4). I know that Tesla T4 is not as the same as the V100 used in Deep Mind's paper, however, I can see that the tensor formulation for the matrix multiplication is highly inefficient, compared to standard JAX matrix multiplication. Any suggestion where am I wrong? submitted by /u/IndependentIce4553 [link] [comments]  ( 42 min )
    [N] Upcoming talk: "Open Problems in Deep Neural Networks: An Optimal Control Perspective"
    Open Problems in Deep Neural Networks: An Optimal Control Perspective Feb 13, 6:30 ET About the talk: Backpropagation is a widely used algorithm for training neural networks. Its key step, Stochastic Gradient Descent (SGD) has become one of the bedrocks of deep learning. Despite wide adoption, mathematically rigorous study of SGD's convergence for deep neural networks is still ongoing. Join us as graduate student Amoolya Tirumalai discusses an approach to the convergence problem inspired by optimal control theory. Following the Pontryargin maximum principle, an alternative forward-backward iterative system will be described. Toy examples will be shown, and problems in robustness and security will be discussed. About the speaker: Amoolya Tirumalai is a 4th year PhD Candidate in Electrical Engineering at the University of Maryland, College Park. His interests are (robust) optimal control, partial differential equations, differential games, mean-field games, safety-critical control, and (robust) machine learning. His thesis titled 'Multi-agent inference, decision-making and control: models, structure and performance evaluation' will be defended in August 2023. Mr. Tirumalai was conferred a BS in Biomedical Engineering from the Georgia Institute of Technology in 2018. submitted by /u/what_comes_next [link] [comments]  ( 43 min )
    [D] Publication Resume
    If we submit a publication to ICML and it is under anonymous review, can I list the title and authors on my resume which will be on my personal webpage? submitted by /u/BigDreamx [link] [comments]  ( 43 min )
    [R] Blogpost on comparing Chatbots like ChatGPT, LaMDA, Sparrow, BlenderBot 3, and Claude
    https://huggingface.co/blog/dialog-agents breaks down the techniques behind ChatGPT -- instruction fine-tuning, supervised fine-tuning, chain-of-thought, read teaming, and more. https://preview.redd.it/fv16fsemd9ea1.png?width=889&format=png&auto=webp&s=a8f24de27c40a946fec64eaa674f81ddef0d0cc3 submitted by /u/emailnazneen [link] [comments]  ( 42 min )
    [D] Efficient retrieval of research information for graduate research
    I have lot of notes about research papers in a particular directory and the number of files has started to become larger than what I can remember off the top of my head. It will continue to keep growing and I have begun to wonder the most efficient way to retrieve the information. I could use ripgrep and regular expressions to find the notes efficiently, but I imagine that if the database is very huge and I don't have the correct regular expression in use, then I might not retrieve the correct files. Inspired by chatGPT, I was impressed at how it presents info from the internet and speeds up my time for finding information even when I do not know the correct keywords. I figured a NLP model primarily trained on my database would be an easier task and I was wondering if someone had already created something like this as open source or how would they go about it? submitted by /u/waterstrider123 [link] [comments]  ( 46 min )
    [D] Accurate data or more data?
    If you are building a model and had the choice, would you prefer more accurate (~99%) but less data or a lot more data but less accurate (~90%)? submitted by /u/NoSympathy9787 [link] [comments]  ( 43 min )
    [R] Best service for scientific paper correction
    Hello, Anyone ever used a paper revision service and can recommend one ? I’m publishing my first paper next month and I want to have feedback from an expert on this domain. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 43 min )
    [D] Self-Supervised Contrastive Approaches that don’t use large batch size.
    This thread is dedicated to exploring the various techniques used in self-supervised contrastive learning that utilize standard batch sizes. I am seeking information on the current methods in this field, specifically those that do not rely on large batch sizes. I am familiar with the SimSiam paper published by META research, which utilizes 256 batch size for 8-GPUs. However, for individuals with limited resources such as myself, access to a large number of GPUs may not be feasible. As a result, I am interested in learning about other methods that can be used with smaller batch sizes and a single GPU, such as those that would be suitable for training on 1024x1024 input images. Additionally, I am curious about any more efficient architectures that have been developed in this field. This includes, but is not limited to, techniques used in natural language processing that may have applications in other areas of artificial intelligence. ***posted the same question in PyTorch forums, reposting here for wider reach. submitted by /u/shingekichan1996 [link] [comments]  ( 46 min )
    [R] Tsetlin Machine in Medical Research - Striking Differences Between Tsetlin Machine Interpretability and Deep Learning Attention
    ​ Tsetlin machine interpretability vs deep learning attention. Researchers at West China Hospital, Sichuan University, NORCE, and UiA have developed a Tsetlin machine-based architecture for premature ventricular contraction identification by analyzing long-term ECG signals. The experiments show that the Tsetlin machine is capable of producing human-interpretable rules, consistent with the clinical standard and medical knowledge. Simultaneously, the accuracy was comparable with deep CNN-based models. Paper: https://arxiv.org/abs/2301.10181 submitted by /u/olegranmo [link] [comments]  ( 43 min )
    [R] INSTRUCTOR One Embedder , Any Task: Instruction-Finetuned Text Embeddings Paper Explanation and Collab Demo
    In this video I explain about INSTRUCTOR, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor achieves sota on 70 diverse embedding tasks! I also show a google collab demo of instructor https://youtu.be/vg38cq3KJ6M submitted by /u/Sea-Photo5230 [link] [comments]  ( 42 min )
    [R] Easiest way to train RNN's in MATLAB or Julia?
    I work as as a researcher and am kind of new to neural networks. I have an RNN (1e4 x 1e4 network) that I would like to train in either MATLAB or Julia. One option I considered is writing my own code for Hessian-free optimization, but the implementational details are really, really hard to figure out. I am aware there is a Theano or TF implementation of HFO but I I am primarily interested in having the code in MATLAB/Julia. Also, are there better/alternative techniques than Hessian-free optimization for training RNN's ? submitted by /u/NadaBrothers [link] [comments]  ( 44 min )
    Can an AI model licensed under the BigScience RAIL License v1.0 such as BLOOM be used in a program that is useful for any domain? [D]
    Example: the AI model BLOOM) is licensed under the BigScience RAIL License v1.0. The BigScience RAIL License v1.0 forbids that some types of usages: You agree not to use the Model or Derivatives of the Model: [...] To provide medical advice and medical results interpretation; To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use). Am I allowed to use BLOOM in a program that is useful for any domain (e.g., a program to summarize or paraphrase some text, or perform question-answer on a text, or generate questions and their answers based on the text)? Since people could use the program for any domain, they could technically, for example, use the program to summarize a medical report or generate questions and their answers based on some asylum process to distribute to potential applicants. submitted by /u/Franck_Dernoncourt [link] [comments]  ( 44 min )
  • Open

    DM Control Suite vs. Original Environments
    I’m testing out DM control suite as I’d ideally like to do some stuff with the MuJoCo environments, e.g. Hopper. However, it seems as though they’ve changed Hopper from the OpenAI version? For instance, the action space is now 4-dimensional, and the bigger concern for me is that the reward seems to be specified differently. Per the gym documentation, the reward was healthy_reward + forward_reward - ctrl_cost, but when I’ve just started using the control suite version all rewards seem to be 0. The documentation for the control suite is quite poor, it says that for the hop task it is rewarded for torso heigh and forward velocity. It also doesn’t explain what the action dimensions correspond to (including the new dimension), so I can’t even manually test it! submitted by /u/DefinitelyNot4Burner [link] [comments]  ( 41 min )
    Does action masking reduce the ability of the agent to learn game rules?
    I recently experimented with training an sb3 PPO agent on a pretty complicated board game environment (just for fun). At first, I did regular PPO with an invalid action penalty, but it was making a lot of invalid moves and thus getting penalized and terminated early. It very slowly picked up on the signal and started to learn, but much too slowly to get any good results. After days of training, it could usually only play a handful of opening moves. On the other hand, I trained a Masked PPO in the same environment and it rapidly became quite good and was able to play relatively competitively after a few days of training. However, when I examined the outputs in an unmasked setting, it had little-to-no understanding of the game rules. It could still play OK but did not rank valid moves as the highest. This is a problem because I wanted to use it in a non-simulator setting without having to explicitly manually mask the moves by hand (or else convert a game state to a mask, both of which are tedious in my situation). Is this behavior expected? I have read some analyses that suggest that 1) MaskedPPO is much more sample efficient and should converge to a stronger agent MUCH faster, which makes sense, but also that 2) Even despite the invalid action masking, the agent should still learn game mechanics by proxy. If it's only being rewarded for making valid moves, it should learn to not make invalid moves implicitly since it never gets a reward signal for them (rather than being explicitly penalized). Thoughts? I only have a weak background in RL so apologies if this is naive. TLDR: Does action masking make the policy (or reward) network lazy? submitted by /u/TobusFire [link] [comments]  ( 44 min )
    Any List of Videogames With Reinforced Learning Agents Developed?
    Is there any list of videogames for which agents using reinforcement learning have been developed? Enquiring minds wanna know. submitted by /u/sanman [link] [comments]  ( 42 min )
    Learning to Exploit Elastic Actuators for Quadruped Locomotion
    submitted by /u/araffin2 [link] [comments]  ( 40 min )
    What is the limit on parallel environments?
    Is there some sort of hard or practical limit on the number of parallel environments that can be used? In Rllib when I try to use more 7 or 8 I get a scheduling error but yet I see people talking about 32 or 512 environments. What’s the limit? Is there some way for me to increase the amount I can train on? For example, my GPU seems under utilised but my CPU is very stressed, can I incrrrase GPU usage in Rllib? I have already set the number of GPUs to one. submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Weird convergence of PPO reward when reducing number of envs
    Hi all, I am using Isaac Gym which enables the usage of multi environments. However, the reward value from the best environment has a huge difference, when training the agent with 512 environment (green) and 32 environment (orange), see below. I understand that the training should be slower when using less environments at the same time, but this difference tells me that I am missing something here... Does anyone have some hints? https://preview.redd.it/3rf0ax8653ea1.png?width=1589&format=png&auto=webp&s=b12defff668f186381b32c9f0385499b4413a3cd Below you can see the configs that I used for the PPO algorithm: config: name: ${resolve_default:CustomTask,${....experiment}} full_experiment_name: ${.name} env_name: rlgpu ppo: True mixed_precision: False normalize_input: True normalize_value: True value_bootstrap: True num_actors: ${....task.env.numEnvs} reward_shaper: scale_value: 1.0 normalize_advantage: True gamma: 0.99 tau: 0.95 learning_rate: 5e-4 lr_schedule: adaptive kl_threshold: 0.008 score_to_win: 10000000 max_epochs: ${resolve_default:5000,${....max_iterations}} save_best_after: 200 save_frequency: 100 print_stats: False use_action_masks: False grad_norm: 1.0 entropy_coef: 0.0001 truncate_grads: True e_clip: 0.2 horizon_length: 32 # num_envs * horizon length % minibatch_size minibatch_size: 1024 mini_epochs: 8 critic_coef: 4 clip_value: True seq_len: 4 bounds_loss_coef: 0.0001 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 41 min )
  • Open

    Learning with Queried Hints
    Posted by Sreenivas Gollapudi, Senior Staff Research Scientist, and Kostas Kollias, Staff Research Scientist, Google Research, Algorithms & Optimization Team In many computing applications the system needs to make decisions to serve requests that arrive in an online fashion. Consider, for instance, the example of a navigation app that responds to driver requests. In such settings there is inherent uncertainty about important aspects of the problem. For example, the preferences of the driver with respect to features of the route are often unknown and the delays of road segments can be uncertain. The field of online machine learning studies such settings and provides various techniques for decision-making problems under uncertainty. A navigation engine has to decide how to route thi…  ( 92 min )
  • Open

    Newbie in need – very bad NLP text generator needed
    Hell y'all, after spending an hour typing various combinations of "AI", "TEXT GENERATOR", "DATA FEEDING" and such I come to you with a humble request; can someone recommend me an AI text generator that needs to be fed actual, existing text(s) instead of giving it a prompt? I need something that will create a text based on an essay I will upload, and I really don't need the result to be super great. I will accept and in fact warmly welcome any nonsensical output, as I need AI to spew out trash, actually. I just need that trash to resemble things that actually exist. Grammatical errors are great, idiotic sentences like "actually, when the orange fades into original edition of Alexander, then how do we expect the succession to leave Canada?" is what I need, the less the final product sounds like what a human could write, the better. I don't have the proper lingo to google what I need to find, so I am deeply grateful for any suggestions. submitted by /u/lindybopperette [link] [comments]  ( 42 min )
    Computer Vision Development
    Hey!! I'm new to this side of things. I studied psych research, always had an interest in data visualization and neuroscience but didn't realize I should be piecing the two interests together and I have been too intimidated to take on the task of learning computer science. But I can't help myself any longer! I'm so fascinated and think reddit could be a great place to learn and chat about concepts. Any who... YA! I've started watching https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1 and already can't believe we haven't solved the process of vision. Have we? Can we? The meta is getting to me. submitted by /u/angelacarolei [link] [comments]  ( 41 min )
  • Open

    Research Focus: Week of January 23, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Revolutionizing Document AI with multimodal document foundation models   Organizations must digitize various documents, many with charts and images, to manage and streamline essential functions. Yet manually […] The post Research Focus: Week of January 23, 2023 appeared first on Microsoft Research.  ( 8 min )
    Biomedical Research Platform Terra Now Available on Microsoft Azure
    We stand at the threshold of a new era of precision medicine, where health and life sciences data hold the potential to dramatically propel and expand our understanding and treatment of human disease. One of the tools that we believe will help to enable precision medicine is Terra, the secure biomedical research platform co-developed by […] The post Biomedical Research Platform Terra Now Available on Microsoft Azure appeared first on Microsoft Research.  ( 9 min )
  • Open

    Build a loyalty points anomaly detector using Amazon Lookout for Metrics
    Today, gaining customer loyalty cannot be a one-off thing. A brand needs a focused and integrated plan to retain its best customers—put simply, it needs a customer loyalty program. Earn and burn programs are one of the main paradigms. A typical earn and burn program rewards customers after a certain number of visits or spend. […]  ( 7 min )
    Explain text classification model predictions using Amazon SageMaker Clarify
    Model explainability refers to the process of relating the prediction of a machine learning (ML) model to the input feature values of an instance in humanly understandable terms. This field is often referred to as explainable artificial intelligence (XAI). Amazon SageMaker Clarify is a feature of Amazon SageMaker that enables data scientists and ML engineers […]  ( 10 min )
    Upscale images with Stable Diffusion in Amazon SageMaker JumpStart
    In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Today, we announce a new feature that lets you upscale images (resize images without losing quality) with Stable Diffusion models in JumpStart. An image that is low resolution, blurry, and pixelated can be converted […]  ( 10 min )
    Cohere brings language AI to Amazon SageMaker
    It’s an exciting day for the development community. Cohere’s state-of-the-art language AI is now available through Amazon SageMaker. This makes it easier for developers to deploy Cohere’s pre-trained generation language model to Amazon SageMaker, an end-to-end machine learning (ML) service. Developers, data scientists, and business analysts use Amazon SageMaker to build, train, and deploy ML models quickly and easily using its fully managed infrastructure, tools, and workflows.  ( 6 min )
  • Open

    Braced From Space: Startup Keeps Watchful Eye on Gas Pipeline Leaks Across the Globe
    As its name suggests, Orbital Sidekick is creating technology that acts as a buddy in outer space, keeping an eye on the globe using satellites to help keep it safe and sustainable. The San Francisco-based startup, a member of the NVIDIA Inception program, enables commercial and government users to optimize sustainable operations and security with Read article >  ( 6 min )
    NVIDIA CEO Ignites AI Conversation in Stockholm
    Jensen Huang headlines Stockholm AI confab, Berzelius supercomputer upgraded to 94 NVIDIA DGX A100 systems.  ( 6 min )
  • Open

    Number of bits in a particular integer
    When I think of bit twiddling, I think of C. So I was surprised to read Paul Khuong saying he thinks of Common Lisp (“CL”). As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration. I would not have thought […] Number of bits in a particular integer first appeared on John D. Cook.  ( 5 min )
  • Open

    Deep Learning-Based Assessment of Cerebral Microbleeds in COVID-19. (arXiv:2301.09322v1 [eess.IV])
    Cerebral Microbleeds (CMBs), typically captured as hypointensities from susceptibility-weighted imaging (SWI), are particularly important for the study of dementia, cerebrovascular disease, and normal aging. Recent studies on COVID-19 have shown an increase in CMBs of coronavirus cases. Automatic detection of CMBs is challenging due to the small size and amount of CMBs making the classes highly imbalanced, lack of publicly available annotated data, and similarity with CMB mimics such as calcifications, irons, and veins. Hence, the existing deep learning methods are mostly trained on very limited research data and fail to generalize to unseen data with high variability and cannot be used in clinical setups. To this end, we propose an efficient 3D deep learning framework that is actively trained on multi-domain data. Two public datasets assigned for normal aging, stroke, and Alzheimer's disease analysis as well as an in-house dataset for COVID-19 assessment are used to train and evaluate the models. The obtained results show that the proposed method is robust to low-resolution images and achieves 78% recall and 80% precision on the entire test set with an average false positive of 1.6 per scan.  ( 2 min )
    Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning. (arXiv:2301.09350v1 [cs.CL])
    Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors, representing topics of interest for the biomedical community. Several related but distinct biomedical concepts are often grouped together in a single coarse-grained descriptor and are treated as a single topic for semantic indexing. This study proposes a new method for the automated refinement of subject annotations at the level of concepts, investigating deep learning approaches. Lacking labelled data for this task, our method relies on weak supervision based on concept occurrence in the abstract of an article. The proposed approach is evaluated on an extended large-scale retrospective scenario, taking advantage of concepts that eventually become MeSH descriptors, for which annotations become available in MEDLINE/PubMed. The results suggest that concept occurrence is a strong heuristic for automated subject annotation refinement and can be further enhanced when combined with dictionary-based heuristics. In addition, such heuristics can be useful as weak supervision for developing deep learning models that can achieve further improvement in some cases.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Parallel Approaches to Accelerate Bayesian Decision Trees. (arXiv:2301.09090v1 [stat.CO])
    Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Existing work on Bayesian decision trees uses MCMC. Unfortunately, this can be slow, especially when considering large volumes of data. It is hard to parallelise the accept-reject component of the MCMC. None-the-less, we propose two methods for exploiting parallelism in the MCMC: in the first, we replace the MCMC with another numerical Bayesian approach, the Sequential Monte Carlo (SMC) sampler, which has the appealing property that it is an inherently parallel algorithm; in the second, we consider data partitioning. Both methods use multi-core processing with a HighPerformance Computing (HPC) resource. We test the two methods in various study settings to determine which method is the most beneficial for each test case. Experiments show that data partitioning has limited utility in the settings we consider and that the use of the SMC sampler can improve run-time (compared to the sequential implementation) by up to a factor of 343.  ( 2 min )
    Learning Reservoir Dynamics with Temporal Self-Modulation. (arXiv:2301.09235v1 [cs.LG])
    Reservoir computing (RC) can efficiently process time-series data by transferring the input signal to randomly connected recurrent neural networks (RNNs), which are referred to as a reservoir. The high-dimensional representation of time-series data in the reservoir significantly simplifies subsequent learning tasks. Although this simple architecture allows fast learning and facile physical implementation, the learning performance is inferior to that of other state-of-the-art RNN models. In this paper, to improve the learning ability of RC, we propose self-modulated RC (SM-RC), which extends RC by adding a self-modulation mechanism. The self-modulation mechanism is realized with two gating variables: an input gate and a reservoir gate. The input gate modulates the input signal, and the reservoir gate modulates the dynamical properties of the reservoir. We demonstrated that SM-RC can perform attention tasks where input information is retained or discarded depending on the input signal. We also found that a chaotic state emerged as a result of learning in SM-RC. This indicates that self-modulation mechanisms provide RC with qualitatively different information-processing capabilities. Furthermore, SM-RC outperformed RC in NARMA and Lorentz model tasks. In particular, SM-RC achieved a higher prediction accuracy than RC with a reservoir 10 times larger in the Lorentz model tasks. Because the SM-RC architecture only requires two additional gates, it is physically implementable as RC, providing a new direction for realizing edge AI.  ( 2 min )
    Dataset Distillation: A Comprehensive Review. (arXiv:2301.07014v2 [cs.LG] UPDATED)
    Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks.Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training \emph{per se} yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation~(DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive experiments and envision possible directions for future works.  ( 2 min )
    Talking About Large Language Models. (arXiv:2212.03551v3 [cs.CL] UPDATED)
    Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.  ( 2 min )
    Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v1 [physics.flu-dyn] CROSS LISTED)
    Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on L2 spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.  ( 2 min )
    Reconstructing Rayleigh-Benard flows out of temperature-only measurements using Physics-Informed Neural Networks. (arXiv:2301.07769v1 [physics.flu-dyn] CROSS LISTED)
    We investigate the capabilities of Physics-Informed Neural Networks (PINNs) to reconstruct turbulent Rayleigh-Benard flows using only temperature information. We perform a quantitative analysis of the quality of the reconstructions at various amounts of low-passed-filtered information and turbulent intensities. We compare our results with those obtained via nudging, a classical equation-informed data assimilation technique. At low Rayleigh numbers, PINNs are able to reconstruct with high precision, comparable to the one achieved with nudging. At high Rayleigh numbers, PINNs outperform nudging and are able to achieve satisfactory reconstruction of the velocity fields only when data for temperature is provided with high spatial and temporal density. When data becomes sparse, the PINNs performance worsens, not only in a point-to-point error sense but also, and contrary to nudging, in a statistical sense, as can be seen in the probability density functions and energy spectra.  ( 2 min )
    Clustering Categorical Data: Soft Rounding k-modes. (arXiv:2210.09640v2 [cs.LG] UPDATED)
    Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.
    Learning-Based Data Storage [Vision] (Technical Report). (arXiv:2206.05778v3 [cs.DB] UPDATED)
    Deep neural network (DNN) and its variants have been extensively used for a wide spectrum of real applications such as image classification, face/speech recognition, fraud detection, and so on. In addition to many important machine learning tasks, as artificial networks emulating the way brain cells function, DNNs also show the capability of storing non-linear relationships between input and output data, which exhibits the potential of storing data via DNNs. We envision a new paradigm of data storage, "DNN-as-a-Database", where data are encoded in well-trained machine learning models. Compared with conventional data storage that directly records data in raw formats, learning-based structures (e.g., DNN) can implicitly encode data pairs of inputs and outputs and compute/materialize actual output data of different resolutions only if input data are provided. This new paradigm can greatly enhance the data security by allowing flexible data privacy settings on different levels, achieve low space consumption and fast computation with the acceleration of new hardware (e.g., Diffractive Neural Network and AI chips), and can be generalized to distributed DNN-based storage/computing. In this paper, we propose this novel concept of learning-based data storage, which utilizes a learning structure called learning-based memory unit (LMU), to store, organize, and retrieve data. As a case study, we use DNNs as the engine in the LMU, and study the data capacity and accuracy of the DNN-based data storage. Our preliminary experimental results show the feasibility of the learning-based data storage by achieving high (100%) accuracy of the DNN storage. We explore and design effective solutions to utilize the DNN-based data storage to manage and query relational tables. We discuss how to generalize our solutions to other data types (e.g., graphs) and environments such as distributed DNN storage/computing.
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.
    Computationally-efficient initialisation of GPs: The generalised variogram method. (arXiv:2210.05394v2 [cs.LG] UPDATED)
    We present a computationally-efficient strategy to find the hyperparameters of a Gaussian process (GP) avoiding the computation of the likelihood function. The found hyperparameters can then be used directly for regression or passed as initial conditions to maximum-likelihood (ML) training. Motivated by the fact that training a GP via ML is equivalent (on average) to minimising the KL-divergence between the true and learnt model, we set to explore different metrics/divergences among GPs that are computationally inexpensive and provide estimates close to those of ML. In particular, we identify the GP hyperparameters by projecting the empirical covariance or (Fourier) power spectrum onto a parametric family, thus proposing and studying various measures of discrepancy operating on the temporal or frequency domains. Our contribution extends the Variogram method developed by the geostatistics literature and, accordingly, it is referred to as the Generalised Variogram method (GVM). In addition to the theoretical presentation of GVM, we provide experimental validation in terms of accuracy, consistency with ML and computational complexity for different kernels using synthetic and real-world data.
    FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction. (arXiv:2212.01548v2 [cs.LG] UPDATED)
    Most cross-device federated learning (FL) studies focus on the model-homogeneous setting where the global server model and local client models are identical. However, such constraint not only excludes low-end clients who would otherwise make unique contributions to model training but also restrains clients from training large models due to on-device resource bottlenecks. In this work, we propose FedRolex, a partial training (PT)-based approach that enables model-heterogeneous FL and can train a global server model larger than the largest client model. At its core, FedRolex employs a rolling sub-model extraction scheme that allows different parts of the global server model to be evenly trained, which mitigates the client drift induced by the inconsistency between individual client models and server model architectures. We show that FedRolex outperforms state-of-the-art PT-based model-heterogeneous FL methods (e.g. Federated Dropout) and reduces the gap between model-heterogeneous and model-homogeneous FL, especially under the large-model large-dataset regime. In addition, we provide theoretical statistical analysis on its advantage over Federated Dropout and evaluate FedRolex on an emulated real-world device distribution to show that FedRolex can enhance the inclusiveness of FL and boost the performance of low-end devices that would otherwise not benefit from FL. Our code is available at: https://github.com/AIoT-MLSys-Lab/FedRolex
    Incorporating Task-specific Concept Knowledge into Script Learning. (arXiv:2209.00068v2 [cs.CL] UPDATED)
    In this paper, we present Tetris, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including preferences and history. To address this problem, we propose a novel approach, which uses two techniques to improve performance: (1) concept prompting, and (2) script-oriented contrastive learning that addresses step repetition and hallucination problems. On our WikiHow-based dataset, we find that both methods improve performance. The dataset, repository, and models will be publicly available to facilitate further research on this new task.
    Concept-level Debugging of Part-Prototype Networks. (arXiv:2205.15769v2 [cs.LG] UPDATED)
    Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounders and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model's explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. Our experimental evaluation shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost. An online experiment with laypeople confirms the simplicity of the feedback requested to the users and the effectiveness of the collected feedback for learning confounder-free part-prototypes. ProtoPDebug is a promising tool for trustworthy interactive learning in critical applications, as suggested by a preliminary evaluation on a medical decision making task.
    A Comprehensive Survey on Enterprise Financial Risk Analysis: Problems, Methods, Spotlights and Applications. (arXiv:2211.14997v2 [q-fin.RM] UPDATED)
    Enterprise financial risk analysis aims at predicting the enterprises' future financial risk.Due to the wide application, enterprise financial risk analysis has always been a core research issue in finance. Although there are already some valuable and impressive surveys on risk management, these surveys introduce approaches in a relatively isolated way and lack the recent advances in enterprise financial risk analysis. Due to the rapid expansion of the enterprise financial risk analysis, especially from the computer science and big data perspective, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing enterprise financial risk researches, as well as to summarize and interpret the mechanisms and the strategies of enterprise financial risk analysis in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. This paper provides a systematic literature review of over 300 articles published on enterprise risk analysis modelling over a 50-year period, 1968 to 2022. We first introduce the formal definition of enterprise risk as well as the related concepts. Then, we categorized the representative works in terms of risk type and summarized the three aspects of risk analysis. Finally, we compared the analysis methods used to model the enterprise financial risk. Our goal is to clarify current cutting-edge research and its possible future directions to model enterprise risk, aiming to fully understand the mechanisms of enterprise risk communication and influence and its application on corporate governance, financial institution and government regulation.
    On Investigating the Conservative Property of Score-Based Generative Models. (arXiv:2209.12753v2 [cs.LG] UPDATED)
    Existing Score-based Generative Models (SGMs) can be categorized into constrained SGMs (CSGMs) or unconstrained SGMs (USGMs) according to their parameterization approaches. CSGMs model probability density functions as Boltzmann distributions, and assign their predictions as the negative gradients of some scalar-valued energy functions. On the other hand, USGMs employ flexible architectures capable of directly estimating scores without the need to explicitly model energy functions. In this paper, we demonstrate that the architectural constraints of CSGMs may limit their modeling ability. In addition, we show that USGMs' inability to preserve the property of conservativeness may lead to degraded sampling performance in practice. To address the above issues, we propose Quasi-Conservative Score-based Generative Models (QCSGMs) for keeping the advantages of both CSGMs and USGMs. Our theoretical derivations demonstrate that the training objective of QCSGMs can be efficiently integrated into the training processes by leveraging the Hutchinson trace estimator. In addition, our experimental results on the CIFAR-10, CIFAR-100, ImageNet, and SVHN datasets validate the effectiveness of QCSGMs. Finally, we justify the advantage of QCSGMs using an example of a one-layered autoencoder.
    Learning from Long-Tailed Noisy Data with Sample Selection and Balanced Loss. (arXiv:2211.10906v2 [cs.LG] UPDATED)
    The success of deep learning depends on large-scale and well-curated training data, while data in real-world applications are commonly long-tailed and noisy. Many methods have been proposed to deal with long-tailed data or noisy data, while a few methods are developed to tackle long-tailed noisy data. To solve this, we propose a robust method for learning from long-tailed noisy data with sample selection and balanced loss. Specifically, we separate the noisy training data into clean labeled set and unlabeled set with sample selection, and train the deep neural network in a semi-supervised manner with a balanced loss based on model bias. Extensive experiments on benchmarks demonstrate that our method outperforms existing state-of-the-art methods.
    FairGBM: Gradient Boosting with Fairness Constraints. (arXiv:2209.07850v3 [cs.LG] UPDATED)
    Tabular data is prevalent in many high stakes domains, such as financial services or public policy. Gradient boosted decision trees (GBDT) are popular in these settings due to performance guarantees and low cost. However, in consequential decision-making fairness is a foremost concern. Despite GBDT's popularity, existing in-processing Fair ML methods are either inapplicable to GBDT, or incur in significant train time overhead, or are inadequate for problems with high class imbalance -- a typical issue in these domains. We present FairGBM, a dual ascent learning framework for training GBDT under fairness constraints, with little to no impact on predictive performance when compared to unconstrained GBDT. Since observational fairness metrics are non-differentiable, we have to employ a "proxy-Lagrangian" formulation using smooth convex error rate proxies to enable gradient-based optimization. Our implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
    STaSy: Score-based Tabular data Synthesis. (arXiv:2210.04018v2 [cs.LG] UPDATED)
    Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.
    On the power of foundation models. (arXiv:2211.16327v2 [cs.AI] UPDATED)
    With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest.
    SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. (arXiv:2210.05861v2 [cs.CV] UPDATED)
    Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
    Shortest Path Networks for Graph Property Prediction. (arXiv:2206.01003v4 [cs.LG] UPDATED)
    Most graph neural network models rely on a particular message passing paradigm, where the idea is to iteratively propagate node representations of a graph to each node in the direct neighborhood. While very prominent, this paradigm leads to information propagation bottlenecks, as information is repeatedly compressed at intermediary node representations, which causes loss of information, making it practically impossible to gather meaningful signals from distant nodes. To address this, we propose shortest path message passing neural networks, where the node representations of a graph are propagated to each node in the shortest path neighborhoods. In this setting, nodes can directly communicate between each other even if they are not neighbors, breaking the information bottleneck and hence leading to more adequately learned representations. Our framework generalizes message passing neural networks, resulting in a class of more expressive models, including some recent state-of-the-art models. We verify the capacity of a basic model of this framework on dedicated synthetic experiments, and on real-world graph classification and regression benchmarks, and obtain state-of-the art results.
    Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP. (arXiv:2207.11126v2 [cs.LG] UPDATED)
    We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/\delta)})$ with probability $1-\delta$, where $\mathcal{P}$ and $\mathcal{G}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains $\widetilde{O}(T^{3/4})$ regret.
    DIVISION: Memory Efficient Training via Dual Activation Precision. (arXiv:2208.04187v3 [cs.LG] UPDATED)
    Existing work of activation compressed training relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual Activation Precision (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10x compression of activation maps, and significantly higher training throughput than state-of-the-art ACT methods, without loss of model accuracy.
    Explainable Image Quality Assessments in Teledermatological Photography. (arXiv:2209.04699v2 [cs.CV] UPDATED)
    Image quality is a crucial factor in the effectiveness and efficiency of teledermatological consultations. However, up to 50% of images sent by patients have quality issues, thus increasing the time to diagnosis and treatment. An automated, easily deployable, explainable method for assessing image quality is necessary to improve the current teledermatological consultation flow. We introduce ImageQX, a convolutional neural network for image quality assessment with a learning mechanism for identifying the most common poor image quality explanations: bad framing, bad lighting, blur, low resolution, and distance issues. ImageQX was trained on 26,635 photographs and validated on 9,874 photographs, each annotated with image quality labels and poor image quality explanations by up to 12 board-certified dermatologists. The photographic images were taken between 2017 and 2019 using a mobile skin disease tracking application accessible worldwide. Our method achieves expert-level performance for both image quality assessment and poor image quality explanation. For image quality assessment, ImageQX obtains a macro F1-score of 0.73 +- 0.01, which places it within standard deviation of the pairwise inter-rater F1-score of 0.77 +- 0.07. For poor image quality explanations, our method obtains F1-scores of between 0.37 +- 0.01 and 0.70 +- 0.01, similar to the inter-rater pairwise F1-score of between 0.24 +- 0.15 and 0.83 +- 0.06. Moreover, with a size of only 15 MB, ImageQX is easily deployable on mobile devices. With an image quality detection performance similar to that of dermatologists, incorporating ImageQX into the teledermatology flow can enable a better, faster flow for remote consultations.
    Pseudo-Hamiltonian Neural Networks with State-Dependent External Forces. (arXiv:2206.02660v4 [cs.LG] UPDATED)
    Hybrid machine learning based on Hamiltonian formulations has recently been successfully demonstrated for simple mechanical systems, both energy conserving and not energy conserving. We introduce a pseudo-Hamiltonian formulation that is a generalization of the Hamiltonian formulation via the port-Hamiltonian formulation, and show that pseudo-Hamiltonian neural network models can be used to learn external forces acting on a system. We argue that this property is particularly useful when the external forces are state dependent, in which case it is the pseudo-Hamiltonian structure that facilitates the separation of internal and external forces. Numerical results are provided for a forced and damped mass-spring system and a tank system of higher complexity, and a symmetric fourth-order integration scheme is introduced for improved training on sparse and noisy data.
    Overfitting in quantum machine learning and entangling dropout. (arXiv:2205.11446v2 [quant-ph] UPDATED)
    The ultimate goal in machine learning is to construct a model function that has a generalization capability for unseen dataset, based on given training dataset. If the model function has too much expressibility power, then it may overfit to the training data and as a result lose the generalization capability. To avoid such overfitting issue, several techniques have been developed in the classical machine learning regime, and the dropout is one such effective method. This paper proposes a straightforward analogue of this technique in the quantum machine learning regime, the entangling dropout, meaning that some entangling gates in a given parametrized quantum circuit are randomly removed during the training process to reduce the expressibility of the circuit. Some simple case studies are given to show that this technique actually suppresses the overfitting.
    On A Mallows-type Model For (Ranked) Choices. (arXiv:2207.01783v2 [cs.LG] UPDATED)
    We consider a preference learning setting where every participant chooses an ordered list of $k$ most preferred items among a displayed set of candidates. (The set can be different for every participant.) We identify a distance-based ranking model for the population's preferences and their (ranked) choice behavior. The ranking model resembles the Mallows model but uses a new distance function called Reverse Major Index (RMJ). We find that despite the need to sum over all permutations, the RMJ-based ranking distribution aggregates into (ranked) choice probabilities with simple closed-form expression. We develop effective methods to estimate the model parameters and showcase their generalization power using real data, especially when there is a limited variety of display sets.
    Predicting highway lane-changing maneuvers: A benchmark analysis of machine and ensemble learning algorithms. (arXiv:2204.10807v3 [cs.LG] UPDATED)
    Understanding and predicting highway lane-change maneuvers is essential for driving modeling and its automation. The development of data-based lane-changing decision-making algorithms is nowadays in full expansion. We compare empirically in this article different machine and ensemble learning classification techniques to the MOBIL rule-based model using trajectory data of European two-lane highways. The analysis relies on instantaneous measurements of up to twenty-four spatial-temporal variables with the four neighboring vehicles on current and adjacent lanes. Preliminary descriptive investigations by principal component and logistic analyses allow identifying main variables intending a driver to change lanes. We predict two types of discretionary lane-change maneuvers: overtaking (from the slow to the fast lane) and fold-down (from the fast to the slow lane). The prediction accuracy is quantified using total, lane-changing and lane-keeping errors and associated receiver operating characteristic curves. The benchmark analysis includes logistic model, linear discriminant, decision tree, na\"ive Bayes classifier, support vector machine, neural network machine learning algorithms, and up to ten bagging and stacking ensemble learning meta-heuristics. If the rule-based model provides limited predicting accuracy, the data-based algorithms, devoid of modeling bias, allow significant prediction improvements. Cross validations show that selected neural networks and stacking algorithms allow predicting from a single observation both fold-down and overtaking maneuvers up to four seconds in advance with high accuracy.
    EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (arXiv:2205.13892v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have received extensive research attention for their promising performance in graph machine learning. Despite their extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN, are not robust in the face of homophily changes on test graphs, rendering these models vulnerable to graph structural attacks and with limited capacity in generalizing to graphs of varied homophily levels. Although many methods have been proposed to improve the robustness of GNN models, most of these techniques are restricted to the spatial domain and employ complicated defense mechanisms, such as learning new graph structures or calculating edge attentions. In this paper, we study the problem of designing simple and robust GNN models in the spectral domain. We propose EvenNet, a spectral GNN corresponding to an even-polynomial graph filter. Based on our theoretical analysis in both spatial and spectral domains, we demonstrate that EvenNet outperforms full-order models in generalizing across homophilic and heterophilic graphs, implying that ignoring odd-hop neighbors improves the robustness of GNNs. We conduct experiments on both synthetic and real-world datasets to demonstrate the effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models against structural attacks without introducing additional computational costs and maintains competitiveness in traditional node classification tasks on homophilic and heterophilic graphs.
    Contrastive Learning for Unsupervised Domain Adaptation of Time Series. (arXiv:2206.06243v3 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims at learning a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain. UDA is important in many applications such as medicine, where it is used to adapt risk scores across different patient cohorts. In this paper, we develop a novel framework for UDA of time series data, called CLUDA. Specifically, we propose a contrastive learning framework to learn contextual representations in multivariate time series, so that these preserve label information for the prediction task. In our framework, we further capture the variation in the contextual representations between source and target domain via a custom nearest-neighbor contrastive learning. To the best of our knowledge, ours is the first framework to learn domain-invariant, contextual representation for UDA of time series data. We evaluate our framework using a wide range of time series datasets to demonstrate its effectiveness and show that it achieves state-of-the-art performance for time series UDA.
    An Explainable-AI approach for Diagnosis of COVID-19 using MALDI-ToF Mass Spectrometry. (arXiv:2109.14099v2 [cs.LG] UPDATED)
    The severe acute respiratory syndrome coronavirus type-2 (SARS-CoV-2) caused a global pandemic and imposed immense effects on the global economy. Accurate, cost-effective, and quick tests have proven substantial in identifying infected people and mitigating the spread. Recently, multiple alternative platforms for testing coronavirus disease 2019 (COVID-19) have been published that show high agreement with current gold standard real-time polymerase chain reaction (RT-PCR) results. These new methods do away with nasopharyngeal (NP) swabs, eliminate the need for complicated reagents, and reduce the burden on RT-PCR test reagent supply. In the present work, we have designed an artificial intelligence-based (AI) testing method to provide confidence in the results. Current AI applications to COVID-19 studies often lack a biological foundation in the decision-making process, and our AI approach is one of the earliest to leverage explainable-AI (X-AI) algorithms for COVID-19 diagnosis using mass spectrometry. Here, we have employed X-AI to explain the decision-making process on a local (per-sample) and global (all samples) basis underscored by biologically relevant features. We evaluated our technique with data extracted from human gargle samples and achieved a testing accuracy of 94.44%. Such techniques would strengthen the relationship between AI and clinical diagnostics by providing biomedical researchers and healthcare workers with trustworthy and, most importantly, explainable test results.
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
    A Survey on Distributed Online Optimization and Game. (arXiv:2205.00473v2 [cs.LG] UPDATED)
    Distributed online optimization and game have been increasingly researched in the last decade, mostly motivated by its wide applications in sensor networks, robotics (e.g., distributed target tracking and formation control), smart grids, deep learning, and so forth. In these problems, there is a network of agents who may be cooperative (i.e., distributed online optimization) or noncooperative (i.e., online game) through local information exchanges. And the local cost function of each agent is often time-varying in dynamic and even adversarial environments. At each time, a decision must be made by each agent based on historical information at hand without knowing future information on cost functions. For these problems, a comprehensive survey is still lacking. This paper aims to provide a thorough overview of distributed online optimization and game from the perspective of problem settings, communication, computation, algorithms, and performances. In addition, some potential future directions are also discussed.
    Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function. (arXiv:2205.12856v2 [cs.LG] UPDATED)
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\le\alpha\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $\epsilon$-global optimum is $\mathcal{O}(\epsilon^{-7/(2\alpha)+1})$ for $1\le\alpha< 3/2$ and $\mathcal{\tilde{O}}(\epsilon^{-2/(\alpha)})$ for $3/2\le\alpha\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(\epsilon^{-2})$ for $\alpha=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Sustaining Fairness via Incremental Learning. (arXiv:2208.12212v2 [cs.LG] UPDATED)
    Machine learning systems are often deployed for making critical decisions like credit lending, hiring, etc. While making decisions, such systems often encode the user's demographic information (like gender, age) in their intermediate representations. This can lead to decisions that are biased towards specific demographics. Prior work has focused on debiasing intermediate representations to ensure fair decisions. However, these approaches fail to remain fair with changes in the task or demographic distribution. To ensure fairness in the wild, it is important for a system to adapt to such changes as it accesses new data in an incremental fashion. In this work, we propose to address this issue by introducing the problem of learning fair representations in an incremental learning setting. To this end, we present Fairness-aware Incremental Representation Learning (FaIRL), a representation learning system that can sustain fairness while incrementally learning new tasks. FaIRL is able to achieve fairness and learn new tasks by controlling the rate-distortion function of the learned representations. Our empirical evaluations show that FaIRL is able to make fair decisions while achieving high performance on the target task, outperforming several baselines.
    Generating Diverse Teammates to Train Robust Agents For Ad Hoc Teamwork. (arXiv:2207.14138v2 [cs.LG] UPDATED)
    Ad hoc teamwork (AHT) is the challenge of designing a learner that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. However, optimising information-theoretic diversity metrics may generate teammates with superficially different behaviours, which does not necessarily result in a robust learner that can effectively collaborate with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.
    Do Gradient Inversion Attacks Make Federated Learning Unsafe?. (arXiv:2202.06924v2 [cs.LG] UPDATED)
    Federated learning (FL) allows the collaborative training of AI models without needing to share raw data. This capability makes it especially interesting for healthcare applications where patient and data privacy is of utmost concern. However, recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data. In this work, we show that these attacks presented in the literature are impractical in FL use-cases where the clients' training involves updating the Batch Normalization (BN) statistics and provide a new baseline attack that works for such scenarios. Furthermore, we present new ways to measure and visualize potential data leakage in FL. Our work is a step towards establishing reproducible methods of measuring data leakage in FL and could help determine the optimal tradeoffs between privacy-preserving techniques, such as differential privacy, and model accuracy based on quantifiable metrics. Code is available at https://nvidia.github.io/NVFlare/research/quantifying-data-leakage.
    GANs and Closures: Micro-Macro Consistency in Multiscale Modeling. (arXiv:2208.10715v2 [cs.LG] UPDATED)
    Sampling the phase space of molecular systems -- and, more generally, of complex systems effectively modeled by stochastic differential equations -- is a crucial modeling step in many fields, from protein folding to materials discovery. These problems are often multiscale in nature: they can be described in terms of low-dimensional effective free energy surfaces parametrized by a small number of "slow" reaction coordinates; the remaining "fast" degrees of freedom populate an equilibrium measure on the reaction coordinate values. Sampling procedures for such problems are used to estimate effective free energy differences as well as ensemble averages with respect to the conditional equilibrium distributions; these latter averages lead to closures for effective reduced dynamic models. Over the years, enhanced sampling techniques coupled with molecular simulation have been developed. An intriguing analogy arises with the field of Machine Learning (ML), where Generative Adversarial Networks can produce high dimensional samples from low dimensional probability distributions. This sample generation returns plausible high dimensional space realizations of a model state, from information about its low-dimensional representation. In this work, we present an approach that couples physics-based simulations and biasing methods for sampling conditional distributions with ML-based conditional generative adversarial networks for the same task. The "coarse descriptors" on which we condition the fine scale realizations can either be known a priori, or learned through nonlinear dimensionality reduction. We suggest that this may bring out the best features of both approaches: we demonstrate that a framework that couples cGANs with physics-based enhanced sampling techniques can improve multiscale SDE dynamical systems sampling, and even shows promise for systems of increasing complexity.
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.
    Analyzing Data-Centric Properties for Graph Contrastive Learning. (arXiv:2208.02810v3 [cs.LG] UPDATED)
    Recent analyses of self-supervised learning (SSL) find the following data-centric properties to be critical for learning good representations: invariance to task-irrelevant semantics, separability of classes in some latent space, and recoverability of labels from augmented samples. However, given their discrete, non-Euclidean nature, graph datasets and graph SSL methods are unlikely to satisfy these properties. This raises the question: how do graph SSL methods, such as contrastive learning (CL), work well? To systematically probe this question, we perform a generalization analysis for CL when using generic graph augmentations (GGAs), with a focus on data-centric properties. Our analysis yields formal insights into the limitations of GGAs and the necessity of task-relevant augmentations. As we empirically show, GGAs do not induce task-relevant invariances on common benchmark datasets, leading to only marginal gains over naive, untrained baselines. Our theory motivates a synthetic data generation process that enables control over task-relevant information and boasts pre-defined optimal augmentations. This flexible benchmark helps us identify yet unrecognized limitations in advanced augmentation techniques (e.g., automated methods). Overall, our work rigorously contextualizes, both empirically and theoretically, the effects of data-centric properties on augmentation strategies and learning paradigms for graph SSL.
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.
    Predictive Model for Gross Community Production Rate of Coral Reefs using Ensemble Learning Methodologies. (arXiv:2111.04003v2 [cs.LG] UPDATED)
    Coral reefs play a vital role in maintaining the ecological balance of the marine ecosystem. Various marine organisms depend on coral reefs for their existence and their natural processes. Coral reefs provide the necessary habitat for reproduction and growth for various exotic species of the marine ecosystem. In this article, we discuss the most important parameters which influence the lifecycle of coral and coral reefs such as ocean acidification, deoxygenation and other physical parameters such as flow rate and surface area. Ocean acidification depends on the amount of dissolved Carbon dioxide (CO2). This is due to the release of H+ ions upon the reaction of the dissolved CO2 gases with the calcium carbonate compounds in the ocean. Deoxygenation is another problem that leads to hypoxia which is characterized by a lesser amount of dissolved oxygen in water than the required amount for the existence of marine organisms. In this article, we highlight the importance of physical parameters such as flow rate which influence gas exchange, heat dissipation, bleaching sensitivity, nutrient supply, feeding, waste and sediment removal, growth and reproduction. In this paper, we also bring out these important parameters and propose an ensemble machine learning-based model for analyzing these parameters and provide better rates that can help us to understand and suitably improve the ocean composition which in turn can eminently improve the sustainability of the marine ecosystem, mainly the coral reefs
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Adversarial Examples in Random Neural Networks with General Activations. (arXiv:2203.17209v2 [cs.LG] UPDATED)
    A substantial body of empirical work documents the lack of robustness in deep learning models to adversarial examples. Recent theoretical work proved that adversarial examples are ubiquitous in two-layers networks with sub-exponential width and ReLU or smooth activations, and multi-layer ReLU networks with sub-exponential width. We present a result of the same type, with no restriction on width and for general locally Lipschitz continuous activations. More precisely, given a neural network $f(\,\cdot\,;{\boldsymbol \theta})$ with random weights ${\boldsymbol \theta}$, and feature vector ${\boldsymbol x}$, we show that an adversarial example ${\boldsymbol x}'$ can be found with high probability along the direction of the gradient $\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$. Our proof is based on a Gaussian conditioning technique. Instead of proving that $f$ is approximately linear in a neighborhood of ${\boldsymbol x}$, we characterize the joint distribution of $f({\boldsymbol x};{\boldsymbol \theta})$ and $f({\boldsymbol x}';{\boldsymbol \theta})$ for ${\boldsymbol x}' = {\boldsymbol x}-s({\boldsymbol x})\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$.
    MetaQA: Combining Expert Agents for Multi-Skill Question Answering. (arXiv:2112.01922v3 [cs.CL] UPDATED)
    The recent explosion of question answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or by combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer candidates. Through quantitative and qualitative experiments we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches in both in-domain and out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future developments of multi-agent systems on https://github.com/UKPLab/MetaQA.
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.
    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. (arXiv:2204.04581v3 [cs.CL] UPDATED)
    Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.
    Learning Regionally Decentralized AC Optimal Power Flows with ADMM. (arXiv:2205.03787v3 [eess.SY] UPDATED)
    One potential future for the next generation of smart grids is the use of decentralized optimization algorithms and secured communications for coordinating renewable generation (e.g., wind/solar), dispatchable devices (e.g., coal/gas/nuclear generations), demand response, battery & storage facilities, and topology optimization. The Alternating Direction Method of Multipliers (ADMM) has been widely used in the community to address such decentralized optimization problems and, in particular, the AC Optimal Power Flow (AC-OPF). This paper studies how machine learning may help in speeding up the convergence of ADMM for solving AC-OPF. It proposes a novel decentralized machine-learning approach, namely ML-ADMM, where each agent uses deep learning to learn the consensus parameters on the coupling branches. The paper also explores the idea of learning only from ADMM runs that exhibit high-quality convergence properties, and proposes filtering mechanisms to select these runs. Experimental results on test cases based on the French system demonstrate the potential of the approach in speeding up the convergence of ADMM significantly.
    Short Blocklength Wiretap Channel Codes via Deep Learning: Design and Performance Evaluation. (arXiv:2206.03477v2 [cs.IT] UPDATED)
    We design short blocklength codes for the Gaussian wiretap channel under information-theoretic security guarantees. Our approach consists in decoupling the reliability and secrecy constraints in our code design. Specifically, we handle the reliability constraint via an autoencoder, and handle the secrecy constraint with hash functions. For blocklengths smaller than or equal to 128, we evaluate through simulations the probability of error at the legitimate receiver and the leakage at the eavesdropper for our code construction. This leakage is defined as the mutual information between the confidential message and the eavesdropper's channel observations, and is empirically measured via a neural network-based mutual information estimator. Our simulation results provide examples of codes with positive secrecy rates that outperform the best known achievable secrecy rates obtained non-constructively for the Gaussian wiretap channel. Additionally, we show that our code design is suitable for the compound and arbitrarily varying Gaussian wiretap channels, for which the channel statistics are not perfectly known but only known to belong to a pre-specified uncertainty set. These models not only capture uncertainty related to channel statistics estimation, but also scenarios where the eavesdropper jams the legitimate transmission or influences its own channel statistics by changing its location.
    Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation. (arXiv:2110.12328v6 [cs.LG] UPDATED)
    Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback. (arXiv:2201.13172v2 [cs.LG] UPDATED)
    The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.
    Explainable deep learning for insights in El Ni\~no and river flows. (arXiv:2201.02596v3 [physics.ao-ph] UPDATED)
    The El Ni\~no Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop understanding of how SSTs influence river flows, and generate improved river flow prediction, including uncertainty estimation. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v3 [cs.GT] UPDATED)
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v5 [cs.LG] UPDATED)
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks. (arXiv:2204.07261v3 [cs.LG] UPDATED)
    We prove linear convergence of gradient descent to a global optimum for the training of deep residual networks with constant layer width and smooth activation function. We show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite $p-$variation with $p=2$. Proofs are based on non-asymptotic estimates for the loss function and for norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
    A Unification Framework for Euclidean and Hyperbolic Graph Neural Networks. (arXiv:2206.04285v2 [cs.LG] UPDATED)
    Hyperbolic neural networks are able to capture the inherent hierarchy of graph datasets, and consequently a powerful choice of GNNs. However, they entangle multiple incongruent (gyro-)vector spaces within a layer, which makes them limited in terms of generalization and scalability. In this work, we propose to use Poincar\'e disk model as our search space, and apply all approximations on the disk (as if the disk is a tangent space derived from the origin), and thus getting rid of all inter-space transformations. Such an approach enables us to propose a hyperbolic normalization layer, and to further simplify the entire hyperbolic model to a Euclidean model cascaded with our hyperbolic normalization layer. We applied our proposed nonlinear hyperbolic normalization to the current state-of-the-art homogeneous and multi-relational graph networks. We demonstrate that not only does the model leverage the power of Euclidean networks such as interpretability and efficient execution of various model components, but also it outperforms both Euclidean and hyperbolic counterparts in our benchmarks.
    From Kepler to Newton: Explainable AI for Science. (arXiv:2111.12210v7 [cs.AI] UPDATED)
    The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific research, it has been sometimes very difficult to manually analyze the data and propose new hypotheses to drive the cycle for scientific discovery. In this paper, we discuss the role of Explainable AI in scientific discovery process by demonstrating an Explainable AI-based paradigm for science discovery. The key is to use Explainable AI to help derive data or model interpretations, hypotheses, as well as scientific discoveries or insights. We show how computational and data-intensive methodology -- together with experimental and theoretical methodology -- can be seamlessly integrated for scientific research. To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler's laws of planetary motion and Newton's law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe's astronomical observation data, whose works were leading the scientific revolution in the 16-17th century. This work also highlights the important role of Explainable AI (as compared to Blackbox AI) in science discovery to help humans prevent or better prepare for the possible technological singularity that may happen in the future, since science is not only about the know how, but also the know why. Presentation of the work is available at https://slideslive.com/38986142/from-kepler-to-newton-explainable-ai-for-science-discovery.
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.
    A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. (arXiv:1809.02362v2 [math.NA] UPDATED)
    Artificial neural networks (ANNs) have very successfully been used in numerical simulations for a series of computational problems ranging from image classification/image recognition, speech recognition, time series analysis, game intelligence, and computational advertising to numerical approximations of partial differential equations (PDEs). Such numerical simulations suggest that ANNs have the capacity to very efficiently approximate high-dimensional functions and, especially, indicate that ANNs seem to admit the fundamental power to overcome the curse of dimensionality when approximating the high-dimensional functions appearing in the above named computational problems. There are a series of rigorous mathematical approximation results for ANNs in the scientific literature. Some of them prove convergence without convergence rates and some even rigorously establish convergence rates but there are only a few special cases where mathematical results can rigorously explain the empirical success of ANNs when approximating high-dimensional functions. The key contribution of this article is to disclose that ANNs can efficiently approximate high-dimensional functions in the case of numerical approximations of Black-Scholes PDEs. More precisely, this work reveals that the number of required parameters of an ANN to approximate the solution of the Black-Scholes PDE grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon > 0$ and the PDE dimension $d \in \mathbb{N}$. We thereby prove, for the first time, that ANNs do indeed overcome the curse of dimensionality in the numerical approximation of Black-Scholes PDEs.
    SUPER-Net: Trustworthy Medical Image Segmentation with Uncertainty Propagation in Encoder-Decoder Networks. (arXiv:2111.05978v3 [eess.IV] UPDATED)
    Deep Learning (DL) holds great promise in reshaping the healthcare industry owing to its precision, efficiency, and objectivity. However, the brittleness of DL models to noisy and out-of-distribution inputs is ailing their deployment in the clinic. Most models produce point estimates without further information about model uncertainty or confidence. This paper introduces a new Bayesian DL framework for uncertainty quantification in segmentation neural networks: SUPER-Net: trustworthy medical image Segmentation with Uncertainty Propagation in Encoder-decodeR Networks. SUPER-Net analytically propagates, using Taylor series approximations, the first two moments (mean and covariance) of the posterior distribution of the model parameters across the nonlinear layers. In particular, SUPER-Net simultaneously learns the mean and covariance without expensive post-hoc Monte Carlo sampling or model ensembling. The output consists of two simultaneous maps: the segmented image and its pixelwise uncertainty map, which corresponds to the covariance matrix of the predictive distribution. We conduct an extensive evaluation of SUPER-Net on medical image segmentation of Magnetic Resonances Imaging and Computed Tomography scans under various noisy and adversarial conditions. Our experiments on multiple benchmark datasets demonstrate that SUPER-Net is more robust to noise and adversarial attacks than state-of-the-art segmentation models. Moreover, the uncertainty map of the proposed SUPER-Net associates low confidence (or equivalently high uncertainty) to patches in the test input images that are corrupted with noise, artifacts, or adversarial attacks. Perhaps more importantly, the model exhibits the ability of self-assessment of its segmentation decisions, notably when making erroneous predictions due to noise or adversarial examples.
    Toward Foundation Models for Earth Monitoring: Generalizable Deep Learning Models for Natural Hazard Segmentation. (arXiv:2301.09318v1 [cs.CV])
    Climate change results in an increased probability of extreme weather events that put societies and businesses at risk on a global scale. Therefore, near real-time mapping of natural hazards is an emerging priority for the support of natural disaster relief, risk management, and informing governmental policy decisions. Recent methods to achieve near real-time mapping increasingly leverage deep learning (DL). However, DL-based approaches are designed for one specific task in a single geographic region based on specific frequency bands of satellite data. Therefore, DL models used to map specific natural hazards struggle with their generalization to other types of natural hazards in unseen regions. In this work, we propose a methodology to significantly improve the generalizability of DL natural hazards mappers based on pre-training on a suitable pre-task. Without access to any data from the target domain, we demonstrate this improved generalizability across four U-Net architectures for the segmentation of unseen natural hazards. Importantly, our method is invariant to geographic differences and differences in the type of frequency bands of satellite data. By leveraging characteristics of unlabeled images from the target domain that are publicly available, our approach is able to further improve the generalization behavior without fine-tuning. Thereby, our approach supports the development of foundation models for earth monitoring with the objective of directly segmenting unseen natural hazards across novel geographic regions given different sources of satellite imagery.
    Synthesis of Compositional Animations from Textual Descriptions. (arXiv:2103.14675v6 [cs.CV] UPDATED)
    "How can we animate 3D-characters from a movie script or move robots by simply telling them what we would like them to do?" "How unstructured and complex can we make a sentence and still generate plausible movements from it?" These are questions that need to be answered in the long-run, as the field is still in its infancy. Inspired by these problems, we present a new technique for generating compositional actions, which handles complex input sentences. Our output is a 3D pose sequence depicting the actions in the input sentence. We propose a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion. We learn two manifold representations of the motion -- one each for the upper body and the lower body movements. Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions. We evaluate our proposed model on the publicly available KIT Motion-Language Dataset containing 3D pose data with human-annotated sentences. Experimental results show that our model advances the state-of-the-art on text-based motion synthesis in objective evaluations by a margin of 50%. Qualitative evaluations based on a user study indicate that our synthesized motions are perceived to be the closest to the ground-truth motion captures for both short and compositional sentences.
    Rig Inversion by Training a Differentiable Rig Function. (arXiv:2301.09567v1 [cs.GR])
    Rig inversion is the problem of creating a method that can find the rig parameter vector that best approximates a given input mesh. In this paper we propose to solve this problem by first obtaining a differentiable rig function by training a multi layer perceptron to approximate the rig function. This differentiable rig function can then be used to train a deep learning model of rig inversion.
    Dataset Structural Index: Leveraging a machine's perspective towards visual data. (arXiv:2110.04070v3 [cs.CV] UPDATED)
    With advances in vision and perception architectures, we have realized that working with data is equally crucial, if not more, than the algorithms. Till today, we have trained machines based on our knowledge and perspective of the world. The entire concept of Dataset Structural Index(DSI) revolves around understanding a machine`s perspective of the dataset. With DSI, I show two meta values with which we can get more information over a visual dataset and use it to optimize data, create better architectures, and have an ability to guess which model would work best. These two values are the Variety contribution ratio and Similarity matrix. In the paper, I show many applications of DSI, one of which is how the same level of accuracy can be achieved with the same model architectures trained over less amount of data.
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.
    WDC Products: A Multi-Dimensional Entity Matching Benchmark. (arXiv:2301.09521v1 [cs.LG])
    The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size. Generalization to unseen entities is a dimension not covered by any of the existing benchmarks yet but is crucial for evaluating the robustness of entity matching systems. WDC Products is based on heterogeneous product data from thousands of e-shops which mark-up products offers using schema.org annotations. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks and thus allows to directly compare the two alternatives. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that some systems are more training data efficient than others.
    A Time Series Approach to Parkinson's Disease Classification from EEG. (arXiv:2301.09568v1 [q-bio.NC])
    Firstly, we present a novel representation for EEG data, a 7-variate series of band power coefficients, which enables the use of (previously inaccessible) time series classification methods. Specifically, we implement the multi-resolution representation-based time series classification method MrSQL. This is deployed on a challenging early-stage Parkinson's dataset that includes wakeful and sleep EEG. Initial results are promising with over 90% accuracy achieved on all EEG data types used. Secondly, we present a framework that enables high-importance data types and brain regions for classification to be identified. Using our framework, we find that, across different EEG data types, it is the Prefrontal brain region that has the most predictive power for the presence of Parkinson's Disease. This outperformance was statistically significant versus ten of the twelve other brain regions (not significant versus adjacent Left Frontal and Right Frontal regions). The Prefrontal region of the brain is important for higher-order cognitive processes and our results align with studies that have shown neural dysfunction in the prefrontal cortex in Parkinson's Disease.
    BayBFed: Bayesian Backdoor Defense for Federated Learning. (arXiv:2301.09508v1 [cs.LG])
    Federated learning (FL) allows participants to jointly train a machine learning model without sharing their private data with others. However, FL is vulnerable to poisoning attacks such as backdoor attacks. Consequently, a variety of defenses have recently been proposed, which have primarily utilized intermediary states of the global model (i.e., logits) or distance of the local models (i.e., L2-norm) from the global model to detect malicious backdoors. However, as these approaches directly operate on client updates, their effectiveness depends on factors such as clients' data distribution or the adversary's attack strategies. In this paper, we introduce a novel and more generic backdoor defense framework, called BayBFed, which proposes to utilize probability distributions over client updates to detect malicious updates in FL: it computes a probabilistic measure over the clients' updates to keep track of any adjustments made in the updates, and uses a novel detection algorithm that can leverage this probabilistic measure to efficiently detect and filter out malicious updates. Thus, it overcomes the shortcomings of previous approaches that arise due to the direct usage of client updates; as our probabilistic measure will include all aspects of the local client training strategies. BayBFed utilizes two Bayesian Non-Parametric extensions: (i) a Hierarchical Beta-Bernoulli process to draw a probabilistic measure given the clients' updates, and (ii) an adaptation of the Chinese Restaurant Process (CRP), referred by us as CRP-Jensen, which leverages this probabilistic measure to detect and filter out malicious updates. We extensively evaluate our defense approach on five benchmark datasets: CIFAR10, Reddit, IoT intrusion detection, MNIST, and FMNIST, and show that it can effectively detect and eliminate malicious updates in FL without deteriorating the benign performance of the global model.
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.
    FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing Flows. (arXiv:2301.09266v1 [cs.CV])
    Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models since their introduction in Glow. Several attempts have been made to design invertible $k \times k$ convolutions that are efficient in training and sampling passes. Though these attempts have improved the expressivity and sampling efficiency, they severely lagged behind Glow which used only $1 \times 1$ convolutions in terms of sampling time. Also, many of the approaches mask a large number of parameters of the underlying convolution, resulting in lower expressivity on a fixed run-time budget. We propose a $k \times k$ convolutional layer and Deep Normalizing Flow architecture which i.) has a fast parallel inversion algorithm with running time O$(n k^2)$ ($n$ is height and width of the input image and k is kernel size), ii.) masks the minimal amount of learnable parameters in a layer. iii.) gives better forward pass and sampling times comparable to other $k \times k$ convolution-based models on real-world benchmarks. We provide an implementation of the proposed parallel algorithm for sampling using our invertible convolutions on GPUs. Benchmarks on CIFAR-10, ImageNet, and CelebA datasets show comparable performance to previous works regarding bits per dimension while significantly improving the sampling time.
    Estimating the energy requirements for long term memory formation. (arXiv:2301.09565v1 [q-bio.NC])
    Brains consume metabolic energy to process information, but also to store memories. The energy required for memory formation can be substantial, for instance in fruit flies memory formation leads to a shorter lifespan upon subsequent starvation (Mery and Kawecki, 2005). Here we estimate that the energy required corresponds to about 10mJ/bit and compare this to biophysical estimates as well as energy requirements in computer hardware. We conclude that biological memory storage is expensive, but the reason behind it is not known.
    Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction. (arXiv:2301.08951v1 [cs.CV])
    When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when the object is completely occluded from partial viewpoints. Meanwhile, humans can imagine the novel views after observing multiple viewpoints. The remarkable recent advance in multi-view object-centric learning leaves some problems: 1) the partially or completely occluded shape of objects can not be well reconstructed. 2) the novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit view rules. This makes the agent fail to perform like humans. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of the object accurately, we enhance the disentanglement between different latent representations: view latent representations are jointly inferred based on the Transformer and then cooperate with the sequential extension of Slot Attention to learn object-centric representations. The model also achieves the new ability: Gaussian processes are employed as priors of view latent variables for generation and novel-view prediction without viewpoint annotations. Experiments on multiple specifically designed synthetic datasets have shown that the proposed model can 1) make the video decomposition, 2) reconstruct the complete shapes of objects, and 3) make the novel viewpoint prediction without viewpoint annotations.
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.
    The Entoptic Field Camera as Metaphor-Driven Research-through-Design with AI Technologies. (arXiv:2301.09545v1 [cs.HC])
    Artificial intelligence (AI) technologies are widely deployed in smartphone photography; and prompt-based image synthesis models have rapidly become commonplace. In this paper, we describe a Research-through-Design (RtD) project which explores this shift in the means and modes of image production via the creation and use of the Entoptic Field Camera. Entoptic phenomena usually refer to perceptions of floaters or bright blue dots stemming from the physiological interplay of the eye and brain. We use the term entoptic as a metaphor to investigate how the material interplay of data and models in AI technologies shapes human experiences of reality. Through our case study using first-person design and a field study, we offer implications for critical, reflective, more-than-human and ludic design to engage AI technologies; the conceptualisation of an RtD research space which contributes to AI literacy discourses; and outline a research trajectory concerning materiality and design affordances of AI technologies.
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
    ECGAN: Self-supervised generative adversarial network for electrocardiography. (arXiv:2301.09496v1 [cs.LG])
    High-quality synthetic data can support the development of effective predictive models for biomedical tasks, especially in rare diseases or when subject to compelling privacy constraints. These limitations, for instance, negatively impact open access to electrocardiography datasets about arrhythmias. This work introduces a self-supervised approach to the generation of synthetic electrocardiography time series which is shown to promote morphological plausibility. Our model (ECGAN) allows conditioning the generative process for specific rhythm abnormalities, enhancing synchronization and diversity across samples with respect to literature models. A dedicated sample quality assessment framework is also defined, leveraging arrhythmia classifiers. The empirical results highlight a substantial improvement against state-of-the-art generative models for sequences and audio synthesis.
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.
    DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion. (arXiv:2301.09474v1 [cs.LG])
    Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t.~a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFormer (diffusion-based Transformers), with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction.
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.
    Digital Twins for Marine Operations: A Brief Review on Their Implementation. (arXiv:2301.09574v1 [eess.SY])
    While the concept of a digital twin to support maritime operations is gaining attention for predictive maintenance, real-time monitoring, control, and overall process optimization, clarity on its implementation is missing in the literature. Therefore, in this review we show how different authors implemented their digital twins, discuss our findings, and finally give insights on future research directions.
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.
    M22: A Communication-Efficient Algorithm for Federated Learning Inspired by Rate-Distortion. (arXiv:2301.09269v1 [cs.LG])
    In federated learning (FL), the communication constraint between the remote learners and the Parameter Server (PS) is a crucial bottleneck. For this reason, model updates must be compressed so as to minimize the loss in accuracy resulting from the communication constraint. This paper proposes ``\emph{${\bf M}$-magnitude weighted $L_{\bf 2}$ distortion + $\bf 2$ degrees of freedom''} (M22) algorithm, a rate-distortion inspired approach to gradient compression for federated training of deep neural networks (DNNs). In particular, we propose a family of distortion measures between the original gradient and the reconstruction we referred to as ``$M$-magnitude weighted $L_2$'' distortion, and we assume that gradient updates follow an i.i.d. distribution -- generalized normal or Weibull, which have two degrees of freedom. In both the distortion measure and the gradient, there is one free parameter for each that can be fitted as a function of the iteration number. Given a choice of gradient distribution and distortion measure, we design the quantizer minimizing the expected distortion in gradient reconstruction. To measure the gradient compression performance under a communication constraint, we define the \emph{per-bit accuracy} as the optimal improvement in accuracy that one bit of communication brings to the centralized model over the training period. Using this performance measure, we systematically benchmark the choice of gradient distribution and distortion measure. We provide substantial insights on the role of these choices and argue that significant performance improvements can be attained using such a rate-distortion inspired compressor.
    Explaining the effects of non-convergent sampling in the training of Energy-Based Models. (arXiv:2301.09428v1 [cs.LG])
    In this paper, we quantify the impact of using non-convergent Markov chains to train Energy-Based models (EBMs). In particular, we show analytically that EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of empirical statistics of the data, not at the level of the equilibrium measure, but through a precise dynamical process. Our results provide a first-principles explanation for the observations of recent works proposing the strategy of using short runs starting from random initial conditions as an efficient way to generate high-quality samples in EBMs, and lay the groundwork for using EBMs as diffusion models. After explaining this effect in generic EBMs, we analyze two solvable models in which the effect of the non-convergent sampling in the trained parameters can be described in detail. Finally, we test these predictions numerically on the Boltzmann machine.
    StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. (arXiv:2301.09515v1 [cs.LG])
    Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.
    DASTSiam: Spatio-Temporal Fusion and Discriminative Augmentation for Improved Siamese Tracking. (arXiv:2301.09063v1 [cs.CV])
    Tracking tasks based on deep neural networks have greatly improved with the emergence of Siamese trackers. However, the appearance of targets often changes during tracking, which can reduce the robustness of the tracker when facing challenges such as aspect ratio change, occlusion, and scale variation. In addition, cluttered backgrounds can lead to multiple high response points in the response map, leading to incorrect target positioning. In this paper, we introduce two transformer-based modules to improve Siamese tracking called DASTSiam: the spatio-temporal (ST) fusion module and the Discriminative Augmentation (DA) module. The ST module uses cross-attention based accumulation of historical cues to improve robustness against object appearance changes, while the DA module associates semantic information between the template and search region to improve target discrimination. Moreover, Modifying the label assignment of anchors also improves the reliability of the object location. Our modules can be used with all Siamese trackers and show improved performance on several public datasets through comparative and ablation experiments.
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.
    A Framework for Evaluating the Impact of Food Security Scenarios. (arXiv:2301.09320v1 [cs.LG])
    This study proposes an approach for predicting the impacts of scenarios on food security and demonstrates its application in a case study. The approach involves two main steps: (1) scenario definition, in which the end user specifies the assumptions and impacts of the scenario using a scenario template, and (2) scenario evaluation, in which a Vector Autoregression (VAR) model is used in combination with Monte Carlo simulation to generate predictions for the impacts of the scenario based on the defined assumptions and impacts. The case study is based on a proprietary time series food security database created using data from the Food and Agriculture Organization of the United Nations (FAOSTAT), the World Bank, and the United States Department of Agriculture (USDA). The database contains a wide range of data on various indicators of food security, such as production, trade, consumption, prices, availability, access, and nutritional value. The results show that the proposed approach can be used to predict the potential impacts of scenarios on food security and that the proprietary time series food security database can be used to support this approach. The study provides specific insights on how this approach can inform decision-making processes related to food security such as food prices and availability in the case study region.
    An iterative multi-fidelity approach for model order reduction of multi-dimensional input parametric PDE systems. (arXiv:2301.09483v1 [math.NA])
    We propose a parametric sampling strategy for the reduction of large-scale PDE systems with multidimensional input parametric spaces by leveraging models of different fidelity. The design of this methodology allows a user to adaptively sample points ad hoc from a discrete training set with no prior requirement of error estimators. It is achieved by exploiting low-fidelity models throughout the parametric space to sample points using an efficient sampling strategy, and at the sampled parametric points, high-fidelity models are evaluated to recover the reduced basis functions. The low-fidelity models are then adapted with the reduced order models ( ROMs) built by projection onto the subspace spanned by the recovered basis functions. The process continues until the low-fidelity model can represent the high-fidelity model adequately for all the parameters in the parametric space. Since the proposed methodology leverages the use of low-fidelity models to assimilate the solution database, it significantly reduces the computational cost in the offline stage. The highlight of this article is to present the construction of the initial low-fidelity model, and a sampling strategy based on the discrete empirical interpolation method (DEIM). We test this approach on a 2D steady-state heat conduction problem for two different input parameters and make a qualitative comparison with the classical greedy reduced basis method (RBM), and further test on a 9-dimensional parametric non-coercive elliptic problem and analyze the computational performance based on different tuning of greedy selection of points.
    Speeding Up BatchBALD: A k-BALD Family of Approximations for Active Learning. (arXiv:2301.09490v1 [cs.LG])
    Active learning is a powerful method for training machine learning models with limited labeled data. One commonly used technique for active learning is BatchBALD, which uses Bayesian neural networks to find the most informative points to label in a pool set. However, BatchBALD can be very slow to compute, especially for larger datasets. In this paper, we propose a new approximation, k-BALD, which uses k-wise mutual information terms to approximate BatchBALD, making it much less expensive to compute. Results on the MNIST dataset show that k-BALD is significantly faster than BatchBALD while maintaining similar performance. Additionally, we also propose a dynamic approach for choosing k based on the quality of the approximation, making it more efficient for larger datasets.
    A Simple Recipe for Competitive Low-compute Self supervised Vision Models. (arXiv:2301.09451v1 [cs.CV])
    Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model. Thus, we call our method Replace one Branch (RoB) as it simply replaces one branch of the joint-embedding training with a large teacher model. RoB is widely applicable to a number of architectures such as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV or iBOT. When pretraining on the ImageNet dataset, RoB yields models that compete with supervised knowledge distillation. When applied to MSN, RoB produces students with strong semi-supervised capabilities. Finally, our best ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by $2.3\%$ and are on par or better than a supervised distilled DeiT on five downstream transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We hope RoB enables practical self-supervision at smaller scale.
    New Insights into Multi-Calibration. (arXiv:2301.08837v1 [cs.LG])
    We identify a novel connection between the recent literature on multi-group fairness for prediction algorithms and well-established notions of graph regularity from extremal graph theory. We frame our investigation using new, statistical distance-based variants of multi-calibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new multi-calibration algorithms with improved complexity in certain parameter regimes, and to a generalization of a state-of-the-art result on omniprediction. Along the way, we also unify several prior algorithms for achieving multi-group fairness, as well as their analyses, through the lens of no-regret learning.
    LSTM and CNN application for core-collapse supernova search in gravitational wave real data. (arXiv:2301.09387v1 [astro-ph.IM])
    $Context.$ Core-collapse supernovae (CCSNe) are expected to emit gravitational wave signals that could be detected by current and future generation interferometers within the Milky Way and nearby galaxies. The stochastic nature of the signal arising from CCSNe requires alternative detection methods to matched filtering. $Aims.$ We aim to show the potential of machine learning (ML) for multi-label classification of different CCSNe simulated signals and noise transients using real data. We compared the performance of 1D and 2D convolutional neural networks (CNNs) on single and multiple detector data. For the first time, we tested multi-label classification also with long short-term memory (LSTM) networks. $Methods.$ We applied a search and classification procedure for CCSNe signals, using an event trigger generator, the Wavelet Detection Filter (WDF), coupled with ML. We used time series and time-frequency representations of the data as inputs to the ML models. To compute classification accuracies, we simultaneously injected, at detectable distance of 1\,kpc, CCSN waveforms, obtained from recent hydrodynamical simulations of neutrino-driven core-collapse, onto interferometer noise from the O2 LIGO and Virgo science run. $Results.$ We compared the performance of the three models on single detector data. We then merged the output of the models for single detector classification of noise and astrophysical transients, obtaining overall accuracies for LIGO ($\sim99\%$) and ($\sim80\%$) for Virgo. We extended our analysis to the multi-detector case using triggers coincident among the three ITFs and achieved an accuracy of $\sim98\%$.
    Ordinal Regression for Difficulty Estimation of StepMania Levels. (arXiv:2301.09485v1 [cs.LG])
    StepMania is a popular open-source clone of a rhythm-based video game. As is common in popular games, there is a large number of community-designed levels. It is often difficult for players and level authors to determine the difficulty level of such community contributions. In this work, we formalize and analyze the difficulty prediction task on StepMania levels as an ordinal regression (OR) task. We standardize a more extensive and diverse selection of this data resulting in five data sets, two of which are extensions of previous work. We evaluate many competitive OR and non-OR models, demonstrating that neural network-based models significantly outperform the state of the art and that StepMania-level data makes for an excellent test bed for deep OR models. We conclude with a user experiment showing our trained models' superiority over human labeling.
    Hierarchically branched diffusion models for efficient and interpretable multi-class conditional generation. (arXiv:2212.10777v2 [cs.LG] UPDATED)
    Diffusion models have achieved justifiable popularity by attaining state-of-the-art performance in generating realistic objects, including when conditioning generation on labels. Current diffusion models are universally linear in nature, modeling diffusion identically for objects of all classes. For the multi-class conditional generation problem, we propose a novel, structurally unique framework of diffusion models which are hierarchically branched according to the inherent relationships between classes. In this work, we showcase several advantages of branched diffusion models. We demonstrate that branched models generate samples more efficiently, and are more easily extended to novel classes in a continual-learning setting. We also show that branched models enjoy a unique interpretability that offers insight into the modeled data distribution. Branched diffusion models represent an alternative paradigm to their traditional linear counterparts, and can have large impacts in how we use diffusion models for efficient generation, online learning, and scientific discovery.
    LF-checker: Machine Learning Acceleration of Bounded Model Checking for Concurrency Verification (Competition Contribution). (arXiv:2301.09142v1 [cs.LG])
    We describe and evaluate LF-checker, a metaverifier tool based on machine learning. It extracts multiple features of the program under test and predicts the optimal configuration (flags) of a bounded model checker with a decision tree. Our current work is specialised in concurrency verification and employs ESBMC as a back-end verification engine. In the paper, we demonstrate that LF-checker achieves better results than the default configuration of the underlying verification engine.
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.
    SMDDH: Singleton Mention detection using Deep Learning in Hindi Text. (arXiv:2301.09361v1 [cs.CL])
    Mention detection is an important component of coreference resolution system, where mentions such as name, nominal, and pronominals are identified. These mentions can be purely coreferential mentions or singleton mentions (non-coreferential mentions). Coreferential mentions are those mentions in a text that refer to the same entities in a real world. Whereas, singleton mentions are mentioned only once in the text and do not participate in the coreference as they are not mentioned again in the following text. Filtering of these singleton mentions can substantially improve the performance of a coreference resolution process. This paper proposes a singleton mention detection module based on a fully connected network and a Convolutional neural network for Hindi text. This model utilizes a few hand-crafted features and context information, and word embedding for words. The coreference annotated Hindi dataset comprising of 3.6K sentences, and 78K tokens are used for the task. In terms of Precision, Recall, and F-measure, the experimental findings obtained are excellent.
    A Comprehensive Survey on Heart Sound Analysis in the Deep Learning Era. (arXiv:2301.09362v1 [cs.SD])
    Heart sound auscultation has been demonstrated to be beneficial in clinical usage for early screening of cardiovascular diseases. Due to the high requirement of well-trained professionals for auscultation, automatic auscultation benefiting from signal processing and machine learning can help auxiliary diagnosis and reduce the burdens of training professional clinicians. Nevertheless, classic machine learning is limited to performance improvement in the era of big data. Deep learning has achieved better performance than classic machine learning in many research fields, as it employs more complex model architectures with stronger capability of extracting effective representations. Deep learning has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were given before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning in the past six years 2017--2022. We introduce both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.
    MATT: Multimodal Attention Level Estimation for e-learning Platforms. (arXiv:2301.09174v1 [cs.CV])
    This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the camera). The multimodal system uses the following modules based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose estimation, facial landmark detection, and facial expression features. First, we individually evaluate the proposed modules in the task of estimating the student's attention level captured during online e-learning sessions. For that we trained binary classifiers (high or low attention) based on Support Vector Machines (SVM) for each module. Secondly, we find out to what extent multimodal score level fusion improves the attention level estimation. The mEBAL database is used in the experimental framework, a public multi-modal database for attention level estimation obtained in an e-learning environment that contains data from 38 users while conducting several e-learning tasks of variable difficulty (creating changes in student cognitive loads).
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.
    Enabling Hard Constraints in Differentiable Neural Network and Accelerator Co-Exploration. (arXiv:2301.09312v1 [cs.LG])
    Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.
    On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment. (arXiv:2301.09420v1 [cs.LG])
    Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment.
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.
    Prompt Federated Learning for Weather Forecasting: Toward Foundation Models on Meteorological Data. (arXiv:2301.09152v1 [cs.LG])
    To tackle the global climate challenge, it urgently needs to develop a collaborative platform for comprehensive weather forecasting on large-scale meteorological data. Despite urgency, heterogeneous meteorological sensors across countries and regions, inevitably causing multivariate heterogeneity and data exposure, become the main barrier. This paper develops a foundation model across regions capable of understanding complex meteorological data and providing weather forecasting. To relieve the data exposure concern across regions, a novel federated learning approach has been proposed to collaboratively learn a brand-new spatio-temporal Transformer-based foundation model across participants with heterogeneous meteorological data. Moreover, a novel prompt learning mechanism has been adopted to satisfy low-resourced sensors' communication and computational constraints. The effectiveness of the proposed method has been demonstrated on classical weather forecasting tasks using three meteorological datasets with multivariate time series.
    Lower Bounds on Learning Pauli Channels. (arXiv:2301.09192v1 [quant-ph])
    Understanding the noise affecting a quantum device is of fundamental importance for scaling quantum technologies. A particularly important class of noise models is that of Pauli channels, as randomized compiling techniques can effectively bring any quantum channel to this form and are significantly more structured than general quantum channels. In this paper, we show fundamental lower bounds on the sample complexity for learning Pauli channels in diamond norm with unentangled measurements. We consider both adaptive and non-adaptive strategies. In the non-adaptive setting, we show a lower bound of $\Omega(2^{3n}\epsilon^{-2})$ to learn an $n$-qubit Pauli channel. In particular, this shows that the recently introduced learning procedure by Flammia and Wallman is essentially optimal. In the adaptive setting, we show a lower bound of $\Omega(2^{2.5n}\epsilon^{-2})$ for $\epsilon=\mathcal{O}(2^{-n})$, and a lower bound of $\Omega(2^{2n}\epsilon^{-2} )$ for any $\epsilon > 0$. This last lower bound even applies for arbitrarily many sequential uses of the channel, as long as they are only interspersed with other unital operations.
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo
    Abstracting Imperfect Information Away from Two-Player Zero-Sum Games. (arXiv:2301.09159v1 [cs.GT])
    In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem -- thus, computing them can be treated as perfect information problems. Because these regularized equilibria can be made arbitrarily close to Nash equilibria, our result opens the door to a new perspective on solving two-player zero-sum games and, in particular, yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning approaches.
    Optimising complexity of CNN models for resource constrained devices: QRS detection case study. (arXiv:2301.09232v1 [cs.LG])
    Traditional DL models are complex and resource hungry and thus, care needs to be taken in designing Internet of (medical) things (IoT, or IoMT) applications balancing efficiency-complexity trade-off. Recent IoT solutions tend to avoid using deep-learning methods due to such complexities, and rather classical filter-based methods are commonly used. We hypothesize that a shallow CNN model can offer satisfactory level of performance in combination by leveraging other essential solution-components, such as post-processing that is suitable for resource constrained environment. In an IoMT application context, QRS-detection and R-peak localisation from ECG signal as a case study, the complexities of CNN models and post-processing were varied to identify a set of combinations suitable for a range of target resource-limited environments. To the best of our knowledge, finding a deploy-able configuration, by incrementally increasing the CNN model complexity, as required to match the target's resource capacity, and leveraging the strength of post-processing, is the first of its kind. The results show that a shallow 2-layer CNN with a suitable post-processing can achieve $>$90\% F1-score, and the scores continue to improving for 8-32 layer CNNs, which can be used to profile target constraint environment. The outcome shows that it is possible to design an optimal DL solution with known target performance characteristics and resource (computing capacity, and memory) constraints.
    A Survey on Actionable Knowledge. (arXiv:2301.09317v1 [cs.LG])
    Actionable Knowledge Discovery (AKD) is a crucial aspect of data mining that is gaining popularity and being applied in a wide range of domains. This is because AKD can extract valuable insights and information, also known as knowledge, from large datasets. The goal of this paper is to examine different research studies that focus on various domains and have different objectives. The paper will review and discuss the methods used in these studies in detail. AKD is a process of identifying and extracting actionable insights from data, which can be used to make informed decisions and improve business outcomes. It is a powerful tool for uncovering patterns and trends in data that can be used for various applications such as customer relationship management, marketing, and fraud detection. The research studies reviewed in this paper will explore different techniques and approaches for AKD in different domains, such as healthcare, finance, and telecommunications. The paper will provide a thorough analysis of the current state of AKD in the field and will review the main methods used by various research studies. Additionally, the paper will evaluate the advantages and disadvantages of each method and will discuss any novel or new solutions presented in the field. Overall, this paper aims to provide a comprehensive overview of the methods and techniques used in AKD and the impact they have on different domains.
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.
    Differentially Private Natural Language Models: Recent Advances and Future Directions. (arXiv:2301.09112v1 [cs.CL])
    Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protect privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances on DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then we investigate some existing work on DP-NLP and present its recent developments from two aspects: gradient perturbation based methods and embedding vector perturbation based methods. We also discuss some challenges and future directions of this topic.
    Learning to Reject with a Fixed Predictor: Application to Decontextualization. (arXiv:2301.09044v1 [cs.LG])
    We study the problem of classification with a reject option for a fixed predictor, applicable in natural language processing. \ignore{where many correct labels are often possible} We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong $H$-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of $2\mathord,000$ examples. Our algorithm significantly outperforms the baselines considered, with a $\sim\!\!25\%$ improvement in coverage when halving the error rate, which is only $\sim\!\! 3 \%$ away from the theoretical limit.
    Provable Unrestricted Adversarial Training without Compromise with Generalizability. (arXiv:2301.09069v1 [cs.LG])
    Adversarial training (AT) is widely considered as the most promising strategy to defend against adversarial attacks and has drawn increasing interest from researchers. However, the existing AT methods still suffer from two challenges. First, they are unable to handle unrestricted adversarial examples (UAEs), which are built from scratch, as opposed to restricted adversarial examples (RAEs), which are created by adding perturbations bound by an $l_p$ norm to observed examples. Second, the existing AT methods often achieve adversarial robustness at the expense of standard generalizability (i.e., the accuracy on natural examples) because they make a tradeoff between them. To overcome these challenges, we propose a unique viewpoint that understands UAEs as imperceptibly perturbed unobserved examples. Also, we find that the tradeoff results from the separation of the distributions of adversarial examples and natural examples. Based on these ideas, we propose a novel AT approach called Provable Unrestricted Adversarial Training (PUAT), which can provide a target classifier with comprehensive adversarial robustness against both UAE and RAE, and simultaneously improve its standard generalizability. Particularly, PUAT utilizes partially labeled data to achieve effective UAE generation by accurately capturing the natural data distribution through a novel augmented triple-GAN. At the same time, PUAT extends the traditional AT by introducing the supervised loss of the target classifier into the adversarial loss and achieves the alignment between the UAE distribution, the natural data distribution, and the distribution learned by the classifier, with the collaboration of the augmented triple-GAN. Finally, the solid theoretical analysis and extensive experiments conducted on widely-used benchmarks demonstrate the superiority of PUAT.
    Relaxed Models for Adversarial Streaming: The Advice Model and the Bounded Interruptions Model. (arXiv:2301.09203v1 [cs.DS])
    Streaming algorithms are typically analyzed in the oblivious setting, where we assume that the input stream is fixed in advance. Recently, there is a growing interest in designing adversarially robust streaming algorithms that must maintain utility even when the input stream is chosen adaptively and adversarially as the execution progresses. While several fascinating results are known for the adversarial setting, in general, it comes at a very high cost in terms of the required space. Motivated by this, in this work we set out to explore intermediate models that allow us to interpolate between the oblivious and the adversarial models. Specifically, we put forward the following two models: (1) *The advice model*, in which the streaming algorithm may occasionally ask for one bit of advice. (2) *The bounded interruptions model*, in which we assume that the adversary is only partially adaptive. We present both positive and negative results for each of these two models. In particular, we present generic reductions from each of these models to the oblivious model. This allows us to design robust algorithms with significantly improved space complexity compared to what is known in the plain adversarial model.
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.
    Energy Prediction using Federated Learning. (arXiv:2301.09165v1 [cs.LG])
    In this work, we demonstrate the viability of using federated learning to successfully predict energy consumption as well as solar production for all households within a certain network using low-power and low-space consuming embedded devices. We also demonstrate our prediction performance improving over time without the need for sharing private consumer energy data. We simulate a system with four nodes using data for one year to show this.
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.
    Towards NeuroAI: Introducing Neuronal Diversity into Artificial Neural Networks. (arXiv:2301.09245v1 [cs.NE])
    Throughout history, the development of artificial intelligence, particularly artificial neural networks, has been open to and constantly inspired by the increasingly deepened understanding of the brain, such as the inspiration of neocognitron, which is the pioneering work of convolutional neural networks. Per the motives of the emerging field: NeuroAI, a great amount of neuroscience knowledge can help catalyze the next generation of AI by endowing a network with more powerful capabilities. As we know, the human brain has numerous morphologically and functionally different neurons, while artificial neural networks are almost exclusively built on a single neuron type. In the human brain, neuronal diversity is an enabling factor for all kinds of biological intelligent behaviors. Since an artificial network is a miniature of the human brain, introducing neuronal diversity should be valuable in terms of addressing those essential problems of artificial networks such as efficiency, interpretability, and memory. In this Primer, we first discuss the preliminaries of biological neuronal diversity and the characteristics of information transmission and processing in a biological neuron. Then, we review studies of designing new neurons for artificial networks. Next, we discuss what gains can neuronal diversity bring into artificial networks and exemplary applications in several important fields. Lastly, we discuss the challenges and future directions of neuronal diversity to explore the potential of NeuroAI.
    MEMO : Accelerating Transformers with Memoization on Big Memory Systems. (arXiv:2301.09262v1 [cs.PF])
    Transformers gain popularity because of their superior prediction accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing work to accelerate transformer inferences has limitations because of the changes to transformer architectures or the need for specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the attention mechanism in transformers without the above limitation. Built upon a unique observation that there is a rich similarity in attention computation across inference sequences, we build an attention database upon the emerging big memory system. We introduce the embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 21% performance improvement on average (up to 68%) with the TB-scale attention database and with ignorable loss in inference accuracy.
    Practical Adversarial Attacks Against AI-Driven Power Allocation in a Distributed MIMO Network. (arXiv:2301.09305v1 [eess.SP])
    In distributed multiple-input multiple-output (D-MIMO) networks, power control is crucial to optimize the spectral efficiencies of users and max-min fairness (MMF) power control is a commonly used strategy as it satisfies uniform quality-of-service to all users. The optimal solution of MMF power control requires high complexity operations and hence deep neural network based artificial intelligence (AI) solutions are proposed to decrease the complexity. Although quite accurate models can be achieved by using AI, these models have some intrinsic vulnerabilities against adversarial attacks where carefully crafted perturbations are applied to the input of the AI model. In this work, we show that threats against the target AI model which might be originated from malicious users or radio units can substantially decrease the network performance by applying a successful adversarial sample, even in the most constrained circumstances. We also demonstrate that the risk associated with these kinds of adversarial attacks is higher than the conventional attack threats. Detailed simulations reveal the effectiveness of adversarial attacks and the necessity of smart defense techniques.
    BallGAN: 3D-aware Image Synthesis with a Spherical Background. (arXiv:2301.09091v1 [cs.CV])
    3D-aware GANs aim to synthesize realistic 3D scenes such that they can be rendered in arbitrary perspectives to produce images. Although previous methods produce realistic images, they suffer from unstable training or degenerate solutions where the 3D geometry is unnatural. We hypothesize that the 3D geometry is underdetermined due to the insufficient constraint, i.e., being classified as real image to the discriminator is not enough. To solve this problem, we propose to approximate the background as a spherical surface and represent a scene as a union of the foreground placed in the sphere and the thin spherical background. It reduces the degree of freedom in the background field. Accordingly, we modify the volume rendering equation and incorporate dedicated constraints to design a novel 3D-aware GAN framework named BallGAN. BallGAN has multiple advantages as follows. 1) It produces more reasonable 3D geometry; the images of a scene across different viewpoints have better photometric consistency and fidelity than the state-of-the-art methods. 2) The training becomes much more stable. 3) The foreground can be separately rendered on top of different arbitrary backgrounds.
    Learning to Linearize Deep Neural Networks for Secure and Efficient Private Inference. (arXiv:2301.09254v1 [cs.CV])
    The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer's activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet's superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ~2x fewer ReLUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ~2.32% improved classification accuracy, evaluated on CIFAR-100.
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.
    Condition monitoring and anomaly detection in cyber-physical systems. (arXiv:2301.09030v1 [cs.LG])
    The modern industrial environment is equipping myriads of smart manufacturing machines where the state of each device can be monitored continuously. Such monitoring can help identify possible future failures and develop a cost-effective maintenance plan. However, it is a daunting task to perform early detection with low false positives and negatives from the huge volume of collected data. This requires developing a holistic machine learning framework to address the issues in condition monitoring of high-priority components and develop efficient techniques to detect anomalies that can detect and possibly localize the faulty components. This paper presents a comparative analysis of recent machine learning approaches for robust, cost-effective anomaly detection in cyber-physical systems. While detection has been extensively studied, very few researchers have analyzed the localization of the anomalies. We show that supervised learning outperforms unsupervised algorithms. For supervised cases, we achieve near-perfect accuracy of 98 percent (specifically for tree-based algorithms). In contrast, the best-case accuracy in the unsupervised cases was 63 percent :the area under the receiver operating characteristic curve (AUC) exhibits similar outcomes as an additional metric.
    Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer. (arXiv:2301.09255v1 [cs.CV])
    In recent years, privacy-preserving methods for deep learning have become an urgent problem. Accordingly, we propose the combined use of federated learning (FL) and encrypted images for privacy-preserving image classification under the use of the vision transformer (ViT). The proposed method allows us not only to train models over multiple participants without directly sharing their raw data but to also protect the privacy of test (query) images for the first time. In addition, it can also maintain the same accuracy as normally trained models. In an experiment, the proposed method was demonstrated to well work without any performance degradation on the CIFAR-10 and CIFAR-100 datasets.
    Statistically Optimal Robust Mean and Covariance Estimation for Anisotropic Gaussians. (arXiv:2301.09024v1 [math.ST])
    Assume that $X_{1}, \ldots, X_{N}$ is an $\varepsilon$-contaminated sample of $N$ independent Gaussian vectors in $\mathbb{R}^d$ with mean $\mu$ and covariance $\Sigma$. In the strong $\varepsilon$-contamination model we assume that the adversary replaced an $\varepsilon$ fraction of vectors in the original Gaussian sample by any other vectors. We show that there is an estimator $\widehat \mu$ of the mean satisfying, with probability at least $1 - \delta$, a bound of the form \[ \|\widehat{\mu} - \mu\|_2 \le c\left(\sqrt{\frac{\operatorname{Tr}(\Sigma)}{N}} + \sqrt{\frac{\|\Sigma\|\log(1/\delta)}{N}} + \varepsilon\sqrt{\|\Sigma\|}\right), \] where $c > 0$ is an absolute constant and $\|\Sigma\|$ denotes the operator norm of $\Sigma$. In the same contaminated Gaussian setup, we construct an estimator $\widehat \Sigma$ of the covariance matrix $\Sigma$ that satisfies, with probability at least $1 - \delta$, \[ \left\|\widehat{\Sigma} - \Sigma\right\| \le c\left(\sqrt{\frac{\|\Sigma\|\operatorname{Tr}(\Sigma)}{N}} + \|\Sigma\|\sqrt{\frac{\log(1/\delta)}{N}} + \varepsilon\|\Sigma\|\right). \] Both results are optimal up to multiplicative constant factors. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, several previously known results were either dimension-dependent and required $\Sigma$ to be close to identity, or had a sub-optimal dependence on the contamination level $\varepsilon$. As a part of the analysis, we derive sharp concentration inequalities for central order statistics of Gaussian, folded normal, and chi-squared distributions.
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v3 [cs.LG] UPDATED)
    The black-box nature of deep neural networks (DNN) has brought to attention the issues of transparency and fairness. Deep Reinforcement Learning (Deep RL or DRL), which uses DNN to learn its policy, value functions etc, is thus also subject to similar concerns. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. The framework introduced in this paper is called the Self Reward Design (SRD), inspired by the Inverse Reward Design, and this interpretable design can (1) solve the problem by pure design (although imperfectly) and (2) be optimized like a standard DNN. With deliberate human designs, we show that some RL problems such as lavaland and MuJoCo can be solved using a model constructed with standard NN components with few parameters. Furthermore, with our fish sale auction example, we demonstrate how SRD is used to address situations that will not make sense if black-box models are used, where humanly-understandable semantic-based decision is required.
    Unifying Synergies between Self-supervised Learning and Dynamic Computation. (arXiv:2301.09164v1 [cs.LG])
    Self-supervised learning (SSL) approaches have made major strides forward by emulating the performance of their supervised counterparts on several computer vision benchmarks. This, however, comes at a cost of substantially larger model sizes, and computationally expensive training strategies, which eventually lead to larger inference times making it impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweight sub-network, which usually involves multiple epochs of fine-tuning of a large pre-trained model, making it more computationally challenging. In this work we propose a novel perspective on the interplay between SSL and DC paradigms that can be leveraged to simultaneously learn a dense and gated (sparse/lightweight) sub-network from scratch offering a good accuracy-efficiency trade-off, and therefore yielding a generic and multi-purpose architecture for application specific industrial settings. Our study overall conveys a constructive message: exhaustive experiments on several image classification benchmarks: CIFAR-10, STL-10, CIFAR-100, and ImageNet-100, demonstrates that the proposed training strategy provides a dense and corresponding sparse sub-network that achieves comparable (on-par) performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs under a range of target budgets.
    Raw or Cooked? Object Detection on RAW Images. (arXiv:2301.08965v1 [cs.CV])
    Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.
    Self-Supervised Image Representation Learning: Transcending Masking with Paired Image Overlay. (arXiv:2301.09299v1 [cs.CV])
    Self-supervised learning has become a popular approach in recent years for its ability to learn meaningful representations without the need for data annotation. This paper proposes a novel image augmentation technique, overlaying images, which has not been widely applied in self-supervised learning. This method is designed to provide better guidance for the model to understand underlying information, resulting in more useful representations. The proposed method is evaluated using contrastive learning, a widely used self-supervised learning method that has shown solid performance in downstream tasks. The results demonstrate the effectiveness of the proposed augmentation technique in improving the performance of self-supervised models.
    Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI. (arXiv:2201.00009v3 [cs.LG] UPDATED)
    This paper quantifies the quality of heatmap-based eXplainable AI (XAI) methods w.r.t image classification problem. Here, a heatmap is considered desirable if it improves the probability of predicting the correct classes. Different XAI heatmap-based methods are empirically shown to improve classification confidence to different extents depending on the datasets, e.g. Saliency works best on ImageNet and Deconvolution on Chest X-Ray Pneumonia dataset. The novelty includes a new gap distribution that shows a stark difference between correct and wrong predictions. Finally, the generative augmentative explanation is introduced, a method to generate heatmaps capable of improving predictive confidence to a high level.
    Adapting a Language Model While Preserving its General Knowledge. (arXiv:2301.08986v1 [cs.CL])
    Domain-adaptive pre-training (or DA-training for short), also known as post-training, aims to train a pre-trained general-purpose language model (LM) using an unlabeled corpus of a particular domain to adapt the LM so that end-tasks in the domain can give improved performances. However, existing DA-training methods are in some sense blind as they do not explicitly identify what knowledge in the LM should be preserved and what should be changed by the domain corpus. This paper shows that the existing methods are suboptimal and proposes a novel method to perform a more informed adaptation of the knowledge in the LM by (1) soft-masking the attention heads based on their importance to best preserve the general knowledge in the LM and (2) contrasting the representations of the general and the full (both general and domain knowledge) to learn an integrated representation with both general and domain-specific knowledge. Experimental results will demonstrate the effectiveness of the proposed approach.
    The Shape of Explanations: A Topological Account of Rule-Based Explanations in Machine Learning. (arXiv:2301.09042v1 [cs.LG])
    Rule-based explanations provide simple reasons explaining the behavior of machine learning classifiers at given points in the feature space. Several recent methods (Anchors, LORE, etc.) purport to generate rule-based explanations for arbitrary or black-box classifiers. But what makes these methods work in general? We introduce a topological framework for rule-based explanation methods and provide a characterization of explainability in terms of the definability of a classifier relative to an explanation scheme. We employ this framework to consider various explanation schemes and argue that the preferred scheme depends on how much the user knows about the domain and the probability measure over the feature space.
    Efficient Training Under Limited Resources. (arXiv:2301.09264v1 [cs.LG])
    Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation help DNNs perform much better while these two factors are limited. However, searching for an optimal architecture and the best hyperparameter values besides a good combination of data augmentation techniques under low resources requires many experiments. We present our approach to achieving such a goal in three steps: reducing training epoch time by compressing the model while maintaining the performance compared to the original model, preventing model overfitting when the dataset is small, and performing the hyperparameter tuning. We used NOMAD, which is a blackbox optimization software based on a derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of 86.0 % on a tiny subset of Mini-ImageNet at the ICLR 2021 Hardware Aware Efficient Training (HAET) Challenge and won second place in the competition. The competition results can be found at haet2021.github.io/challenge and our source code can be found at github.com/DouniaLakhmiri/ICLR\_HAET2021.
    Pre-text Representation Transfer for Deep Learning with Limited Imbalanced Data : Application to CT-based COVID-19 Detection. (arXiv:2301.08888v1 [eess.IV])
    Annotating medical images for disease detection is often tedious and expensive. Moreover, the available training samples for a given task are generally scarce and imbalanced. These conditions are not conducive for learning effective deep neural models. Hence, it is common to 'transfer' neural networks trained on natural images to the medical image domain. However, this paradigm lacks in performance due to the large domain gap between the natural and medical image data. To address that, we propose a novel concept of Pre-text Representation Transfer (PRT). In contrast to the conventional transfer learning, which fine-tunes a source model after replacing its classification layers, PRT retains the original classification layers and updates the representation layers through an unsupervised pre-text task. The task is performed with (original, not synthetic) medical images, without utilizing any annotations. This enables representation transfer with a large amount of training data. This high-fidelity representation transfer allows us to use the resulting model as a more effective feature extractor. Moreover, we can also subsequently perform the traditional transfer learning with this model. We devise a collaborative representation based classification layer for the case when we leverage the model as a feature extractor. We fuse the output of this layer with the predictions of a model induced with the traditional transfer learning performed over our pre-text transferred model. The utility of our technique for limited and imbalanced data classification problem is demonstrated with an extensive five-fold evaluation for three large-scale models, tested for five different class-imbalance ratios for CT based COVID-19 detection. Our results show a consistent gain over the conventional transfer learning with the proposed method.
    Spatial Attention Kinetic Networks with E(n)-Equivariance. (arXiv:2301.08893v1 [cs.LG])
    Neural networks that are equivariant to rotations, translations, reflections, and permutations on n-dimensional geometric space have shown promise in physical modeling for tasks such as accurately but inexpensively modeling complex potential energy surfaces to guiding the sampling of complex dynamical systems or forecasting their time evolution. Current state-of-the-art methods employ spherical harmonics to encode higher-order interactions among particles, which are computationally expensive. In this paper, we propose a simple alternative functional form that uses neurally parametrized linear combinations of edge vectors to achieve equivariance while still universally approximating node environments. Incorporating this insight, we design spatial attention kinetic networks with E(n)-equivariance, or SAKE, which are competitive in many-body system modeling tasks while being significantly faster.
    The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation. (arXiv:2301.08968v1 [cs.LG])
    Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -- information that we refer to as ``hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.
    Classification of Luminal Subtypes in Full Mammogram Images Using Transfer Learning. (arXiv:2301.09282v1 [eess.IV])
    Automatic identification of patients with luminal and non-luminal subtypes during a routine mammography screening can support clinicians in streamlining breast cancer therapy planning. Recent machine learning techniques have shown promising results in molecular subtype classification in mammography; however, they are highly dependent on pixel-level annotations, handcrafted, and radiomic features. In this work, we provide initial insights into the luminal subtype classification in full mammogram images trained using only image-level labels. Transfer learning is applied from a breast abnormality classification task, to finetune a ResNet-18-based luminal versus non-luminal subtype classification task. We present and compare our results on the publicly available CMMD dataset and show that our approach significantly outperforms the baseline classifier by achieving a mean AUC score of 0.6688 and a mean F1 score of 0.6693 on the test dataset. The improvement over baseline is statistically significant, with a p-value of p<0.0001.
    A Semantic Modular Framework for Events Topic Modeling in Social Media. (arXiv:2301.09009v1 [cs.LG])
    The advancement of social media contributes to the growing amount of content they share frequently. This framework provides a sophisticated place for people to report various real-life events. Detecting these events with the help of natural language processing has received researchers' attention, and various algorithms have been developed for this goal. In this paper, we propose a Semantic Modular Model (SMM) consisting of 5 different modules, namely Distributional Denoising Autoencoder, Incremental Clustering, Semantic Denoising, Defragmentation, and Ranking and Processing. The proposed model aims to (1) cluster various documents and ignore the documents that might not contribute to the identification of events, (2) identify more important and descriptive keywords. Compared to the state-of-the-art methods, the results show that the proposed model has a higher performance in identifying events with lower ranks and extracting keywords for more important events in three English Twitter datasets: FACup, SuperTuesday, and USElection. The proposed method outperformed the best reported results in the mean keyword-precision metric by 7.9\%.
    Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification. (arXiv:2301.09058v1 [eess.AS])
    Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech information leakage. Hence, to improve age group classification performance, we consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups. In addition, we investigated different types of speaker embeddings to learn and generalize the domain-invariant representations for age groups. Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios and leveraging speaker embeddings for the domain adaptation task.
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.
    Is Nash Equilibrium Approximator Learnable?. (arXiv:2108.07472v5 [cs.GT] UPDATED)
    In this paper, we investigate the learnability of the function approximator that approximates Nash equilibrium (NE) for games generated from a distribution. First, we offer a generalization bound using the Probably Approximately Correct (PAC) learning model. The bound describes the gap between the expected loss and empirical loss of the NE approximator. Afterward, we prove the agnostic PAC learnability of the Nash approximator. In addition to theoretical analysis, we demonstrate an application of NE approximator in experiments. The trained NE approximator can be used to warm-start and accelerate classical NE solvers. Together, our results show the practicability of approximating NE through function approximation.
    Regeneration Learning: A Learning Paradigm for Data Generation. (arXiv:2301.08846v1 [cs.LG])
    Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y' (an abstraction/representation of Y) from X and then generates Y from Y'. During training, Y' is obtained from Y through either handcrafted rules or self-supervised learning and is used to learn X-->Y' and Y'-->Y. Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y. We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.
    Debiasing the Cloze Task in Sequential Recommendation with Bidirectional Transformers. (arXiv:2301.09210v1 [cs.LG])
    Bidirectional Transformer architectures are state-of-the-art sequential recommendation models that use a bi-directional representation capacity based on the Cloze task, a.k.a. Masked Language Modeling. The latter aims to predict randomly masked items within the sequence. Because they assume that the true interacted item is the most relevant one, an exposure bias results, where non-interacted items with low exposure propensities are assumed to be irrelevant. The most common approach to mitigating exposure bias in recommendation has been Inverse Propensity Scoring (IPS), which consists of down-weighting the interacted predictions in the loss function in proportion to their propensities of exposure, yielding a theoretically unbiased learning. In this work, we argue and prove that IPS does not extend to sequential recommendation because it fails to account for the temporal nature of the problem. We then propose a novel propensity scoring mechanism, which can theoretically debias the Cloze task in sequential recommendation. Finally we empirically demonstrate the debiasing capabilities of our proposed approach and its robustness to the severity of exposure bias.
    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
    DeepFEL: Deep Fastfood Ensemble Learning for Histopathology Image Analysis. (arXiv:2301.09525v1 [eess.IV])
    Computational pathology tasks have some unique characterises such as multi-gigapixel images, tedious and frequently uncertain annotations, and unavailability of large number of cases [13]. To address some of these issues, we present Deep Fastfood Ensembles - a simple, fast and yet effective method for combining deep features pooled from popular CNN models pre-trained on totally different source domains (e.g., natural image objects) and projected onto diverse dimensions using random projections, the so-called Fastfood [11]. The final ensemble output is obtained by a consensus of simple individual classifiers, each of which is trained on a different collection of random basis vectors. This offers extremely fast and yet effective solution, especially when training times and domain labels are of the essence. We demonstrate the effectiveness of the proposed deep fastfood ensemble learning as compared to the state-of-the-art methods for three different tasks in histopathology image analysis.
    Accelerating Fair Federated Learning: Adaptive Federated Adam. (arXiv:2301.09357v1 [cs.LG])
    Federated learning is a distributed and privacy-preserving approach to train a statistical model collaboratively from decentralized data of different parties. However, when datasets of participants are not independent and identically distributed (non-IID), models trained by naive federated algorithms may be biased towards certain participants, and model performance across participants is non-uniform. This is known as the fairness problem in federated learning. In this paper, we formulate fairness-controlled federated learning as a dynamical multi-objective optimization problem to ensure fair performance across all participants. To solve the problem efficiently, we study the convergence and bias of Adam as the server optimizer in federated learning, and propose Adaptive Federated Adam (AdaFedAdam) to accelerate fair federated learning with alleviated bias. We validated the effectiveness, Pareto optimality and robustness of AdaFedAdam in numerical experiments and show that AdaFedAdam outperforms existing algorithms, providing better convergence and fairness properties of the federated scheme.
    A Structural Approach to the Design of Domain Specific Neural Network Architectures. (arXiv:2301.09381v1 [cs.LG])
    This is a master's thesis concerning the theoretical ideas of geometric deep learning. Geometric deep learning aims to provide a structured characterization of neural network architectures, specifically focused on the ideas of invariance and equivariance of data with respect to given transformations. This thesis aims to provide a theoretical evaluation of geometric deep learning, compiling theoretical results that characterize the properties of invariant neural networks with respect to learning performance.
    HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks. (arXiv:2301.09422v1 [cs.LG])
    Low-rank compression is an important model compression strategy for obtaining compact neural network models. In general, because the rank values directly determine the model complexity and model accuracy, proper selection of layer-wise rank is very critical and desired. To date, though many low-rank compression approaches, either selecting the ranks in a manual or automatic way, have been proposed, they suffer from costly manual trials or unsatisfied compression performance. In addition, all of the existing works are not designed in a hardware-aware way, limiting the practical performance of the compressed models on real-world hardware platforms. To address these challenges, in this paper we propose HALOC, a hardware-aware automatic low-rank compression framework. By interpreting automatic rank selection from an architecture search perspective, we develop an end-to-end solution to determine the suitable layer-wise ranks in a differentiable and hardware-aware way. We further propose design principles and mitigation strategy to efficiently explore the rank space and reduce the potential interference problem. Experimental results on different datasets and hardware platforms demonstrate the effectiveness of our proposed approach. On CIFAR-10 dataset, HALOC enables 0.07% and 0.38% accuracy increase over the uncompressed ResNet-20 and VGG-16 models with 72.20% and 86.44% fewer FLOPs, respectively. On ImageNet dataset, HALOC achieves 0.9% higher top-1 accuracy than the original ResNet-18 model with 66.16% fewer FLOPs. HALOC also shows 0.66% higher top-1 accuracy increase than the state-of-the-art automatic low-rank compression solution with fewer computational and memory costs. In addition, HALOC demonstrates the practical speedups on different hardware platforms, verified by the measurement results on desktop GPU, embedded GPU and ASIC accelerator.
    StockEmotions: Discover Investor Emotions for Financial Sentiment Analysis and Multivariate Time Series. (arXiv:2301.09279v1 [cs.CL])
    There has been growing interest in applying NLP techniques in the financial domain, however, resources are extremely limited. This paper introduces StockEmotions, a new dataset for detecting emotions in the stock market that consists of 10,000 English comments collected from StockTwits, a financial social media platform. Inspired by behavioral finance, it proposes 12 fine-grained emotion classes that span the roller coaster of investor emotion. Unlike existing financial sentiment datasets, StockEmotions presents granular features such as investor sentiment classes, fine-grained emotions, emojis, and time series data. To demonstrate the usability of the dataset, we perform a dataset analysis and conduct experimental downstream tasks. For financial sentiment/emotion classification tasks, DistilBERT outperforms other baselines, and for multivariate time series forecasting, a Temporal Attention LSTM model combining price index, text, and emotion features achieves the best performance than using a single feature.
    Logical Message Passing Networks with One-hop Inference on Atomic Formulas. (arXiv:2301.08859v1 [cs.LG])
    Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries in the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the \textit{local} one-hop inferences on atomic formulas to the \textit{global} logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to predict the answers' embeddings finally. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph representation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields a new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning.  ( 2 min )
    Impact of PCA-based preprocessing and different CNN structures on deformable registration of sonograms. (arXiv:2301.08802v1 [cs.CV])
    Central venous catheters (CVC) are commonly inserted into the large veins of the neck, e.g. the internal jugular vein (IJV). CVC insertion may cause serious complications like misplacement into an artery or perforation of cervical vessels. Placing a CVC under sonographic guidance is an appropriate method to reduce such adverse events, if anatomical landmarks like venous and arterial vessels can be detected reliably. This task shall be solved by registration of patient individual images vs. an anatomically labelled reference image. In this work, a linear, affine transformation is performed on cervical sonograms, followed by a non-linear transformation to achieve a more precise registration. Voxelmorph (VM), a learning-based library for deformable image registration using a convolutional neural network (CNN) with U-Net structure was used for non-linear transformation. The impact of principal component analysis (PCA)-based pre-denoising of patient individual images, as well as the impact of modified net structures with differing complexities on registration results were examined visually and quantitatively, the latter using metrics for deformation and image similarity. Using the PCA-approximated cervical sonograms resulted in decreased mean deformation lengths between 18% and 66% compared to their original image counterparts, depending on net structure. In addition, reducing the number of convolutional layers led to improved image similarity with PCA images, while worsening in original images. Despite a large reduction of network parameters, no overall decrease in registration quality was observed, leading to the conclusion that the original net structure is oversized for the task at hand.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Fast likelihood-based change point detection. (arXiv:2301.08892v1 [cs.LG])
    Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in $O(n)$ time, where $n$ is the number of entries since the last change point. This is too expensive for large $n$. To combat this we propose an approximation scheme that yields $(1 - \epsilon)$ approximation in $O(\epsilon^{-1} \log^2 n)$ time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size $O(\epsilon^{-1} \log n)$ for model parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Improving Deep Regression with Ordinal Entropy. (arXiv:2301.08915v1 [cs.CV])
    In computer vision, it is often observed that formulating regression problems as a classification task often yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.  ( 2 min )
    Developing Hybrid Machine Learning Models to Assign Health Score to Railcar Fleets for Optimal Decision Making. (arXiv:2301.08877v1 [cs.LG])
    A large amount of data is generated during the operation of a railcar fleet, which can easily lead to dimensional disaster and reduce the resiliency of the railcar network. To solve these issues and offer predictive maintenance, this research introduces a hybrid fault diagnosis expert system method that combines density-based spatial clustering of applications with noise (DBSCAN) and principal component analysis (PCA). Firstly, the DBSCAN method is used to cluster categorical data that are similar to one another within the same group. Secondly, PCA algorithm is applied to reduce the dimensionality of the data and eliminate redundancy in order to improve the accuracy of fault diagnosis. Finally, we explain the engineered features and evaluate the selected models by using the Gain Chart and Area Under Curve (AUC) metrics. We use the hybrid expert system model to enhance maintenance planning decisions by assigning a health score to the railcar system of the North American Railcar Owner (NARO). According to the experimental results, our expert model can detect 96.4% of failures within 50% of the sample. This suggests that our method is effective at diagnosing failures in railcars fleet.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    A Communication-Efficient Adaptive Algorithm for Federated Learning under Cumulative Regret. (arXiv:2301.08869v1 [cs.LG])
    We consider the problem of online stochastic optimization in a distributed setting with $M$ clients connected through a central server. We develop a distributed online learning algorithm that achieves order-optimal cumulative regret with low communication cost measured in the total number of bits transmitted over the entire learning horizon. This is in contrast to existing studies which focus on the offline measure of simple regret for learning efficiency. The holistic measure for communication cost also departs from the prevailing approach that \emph{separately} tackles the communication frequency and the number of bits in each communication round.  ( 2 min )
    Rationalization for Explainable NLP: A Survey. (arXiv:2301.08912v1 [cs.CL])
    Recent advances in deep learning have improved the performance of many Natural Language Processing (NLP) tasks such as translation, question-answering, and text classification. However, this improvement comes at the expense of model explainability. Black-box models make it difficult to understand the internals of a system and the process it takes to arrive at an output. Numerical (LIME, Shapley) and visualization (saliency heatmap) explainability techniques are helpful; however, they are insufficient because they require specialized knowledge. These factors led rationalization to emerge as a more accessible explainable technique in NLP. Rationalization justifies a model's output by providing a natural language explanation (rationale). Recent improvements in natural language generation have made rationalization an attractive technique because it is intuitive, human-comprehensible, and accessible to non-technical users. Since rationalization is a relatively new field, it is disorganized. As the first survey, rationalization literature in NLP from 2007-2022 is analyzed. This survey presents available methods, explainable evaluations, code, and datasets used across various NLP tasks that use rationalization. Further, a new subfield in Explainable AI (XAI), namely, Rational AI (RAI), is introduced to advance the current state of rationalization. A discussion on observed insights, challenges, and future directions is provided to point to promising research opportunities.  ( 2 min )
    Federated Recommendation with Additive Personalization. (arXiv:2301.09109v1 [cs.LG])
    With rising concerns about privacy, developing recommendation systems in a federated setting become a new paradigm to develop next-generation Internet service architecture. However, existing approaches are usually derived from a distributed recommendation framework with an additional mechanism for privacy protection, thus most of them fail to fully exploit personalization in the new context of federated recommendation settings. In this paper, we propose a novel approach called Federated Recommendation with Additive Personalization (FedRAP) to enhance recommendation by learning user embedding and the user's personal view of item embeddings. Specifically, the proposed additive personalization is to add a personalized item embedding to a sparse global item embedding aggregated from all users. Moreover, a curriculum learning mechanism has been applied for additive personalization on item embeddings by gradually increasing regularization weights to mitigate the performance degradation caused by large variances among client-specific item embeddings. A unified formulation has been proposed with a sparse regularization of global item embeddings for reducing communication overhead. Experimental results on four real-world recommendation datasets demonstrate the effectiveness of FedRAP.  ( 2 min )
    Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps. (arXiv:2301.08957v1 [cs.CV])
    Precise localization is critical for autonomous vehicles. We present a self-supervised learning method that employs Transformers for the first time for the task of outdoor localization using LiDAR data. We propose a pre-text task that reorganizes the slices of a $360^\circ$ LiDAR scan to leverage its axial properties. Our model, called Slice Transformer, employs multi-head attention while systematically processing the slices. To the best of our knowledge, this is the first instance of leveraging multi-head attention for outdoor point clouds. We additionally introduce the Perth-WA dataset, which provides a large-scale LiDAR map of Perth city in Western Australia, covering $\sim$4km$^2$ area. Localization annotations are provided for Perth-WA. The proposed localization method is thoroughly evaluated on Perth-WA and Appollo-SouthBay datasets. We also establish the efficacy of our self-supervised learning approach for the common downstream task of object classification using ModelNet40 and ScanNN datasets. The code and Perth-WA data will be publicly released.  ( 2 min )
    Estimation of Sea State Parameters from Ship Motion Responses Using Attention-based Neural Networks. (arXiv:2301.08949v1 [cs.LG])
    On-site estimation of sea state parameters is crucial for ship navigation systems' accuracy, stability, and efficiency. Extensive research has been conducted on model-based estimating methods utilizing only ship motion responses. Model-free approaches based on machine learning (ML) have recently gained popularity, and estimation from time-series of ship motion responses using deep learning (DL) methods has given promising results. Accordingly, in this study, we apply the novel, attention-based neural network (AT-NN) for estimating sea state parameters (wave height, zero-crossing period, and relative wave direction) from raw time-series data of ship pitch, heave, and roll motions. Despite using reduced input data, it has been successfully demonstrated that the proposed approaches by modified state-of-the-art techniques (based on convolutional neural networks (CNN) for regression, multivariate long short-term memory CNN, and sliding puzzle neural network) reduced estimation MSE by 23% and MAE by 16% compared to the original methods. Furthermore, the proposed technique based on AT-NN outperformed all tested methods (original and enhanced), reducing estimation MSE by up to 94% and MAE by up to 70%. Finally, we also proposed a novel approach for interpreting the uncertainty estimation of neural network outputs based on the Monte-Carlo dropout method to enhance the model's trustworthiness.  ( 2 min )
    Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting. (arXiv:2301.08974v1 [cs.LG])
    The semiconductor industry is one of the most technology-evolving and capital-intensive market sectors. Effective inspection and metrology are necessary to improve product yield, increase product quality and reduce costs. In recent years, many semiconductor manufacturing equipments are equipped with sensors to facilitate real-time monitoring of the production process. These production-state and equipment-state sensor data provide an opportunity to practice machine-learning technologies in various domains, such as anomaly/fault detection, maintenance scheduling, quality prediction, etc. In this work, we focus on the task of soft sensing regression, which uses sensor data to predict impending inspection measurements that used to be measured in wafer inspection and metrology systems. We proposed an LSTM-based regressor and designed two loss functions for model training. Although engineers may look at our prediction errors in a subjective manner, a new piece-wise evaluation metric was proposed for assessing model accuracy in a mathematical way. The experimental results demonstrated that the proposed model can achieve accurate and early prediction of various types of inspections in complicated manufacturing processes.  ( 2 min )
    Bayesian Hierarchical Models for Counterfactual Estimation. (arXiv:2301.08833v1 [cs.LG])
    Counterfactual explanations utilize feature perturbations to analyze the outcome of an original decision and recommend an actionable recourse. We argue that it is beneficial to provide several alternative explanations rather than a single point solution and propose a probabilistic paradigm to estimate a diverse set of counterfactuals. Specifically, we treat the perturbations as random variables endowed with prior distribution functions. This allows sampling multiple counterfactuals from the posterior density, with the added benefit of incorporating inductive biases, preserving domain specific constraints and quantifying uncertainty in estimates. More importantly, we leverage Bayesian hierarchical modeling to share information across different subgroups of a population, which can both improve robustness and measure fairness. A gradient based sampler with superior convergence characteristics efficiently computes the posterior samples. Experiments across several datasets demonstrate that the counterfactuals estimated using our approach are valid, sparse, diverse and feasible.  ( 2 min )
    Dense RGB SLAM with Neural Implicit Maps. (arXiv:2301.08930v1 [cs.CV])
    There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. Our source code will be publicly available.  ( 2 min )
    SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis. (arXiv:2301.09201v1 [cs.IR])
    5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort, in this paper, we curate SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains 3,547,586 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on NLP tasks, we use this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task. Our results show the value of our 5G-centric dataset in 5G protocol analysis automation. We believe that SPEC5G will enable a new research direction into automatic analyses for the 5G cellular network protocol and numerous related downstream tasks. Our data and code are publicly available.
    Is Signed Message Essential for Graph Neural Networks?. (arXiv:2301.08918v1 [cs.LG])
    Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes, achieve satisfying results on homophilic graphs. However, their performances are dismal in heterophilous graphs, and many researchers have proposed a plethora of schemes to solve this problem. Especially, flipping the sign of edges is rooted in a strong theoretical foundation, and attains significant performance enhancements. Nonetheless, previous analyses assume a binary class scenario and they may suffer from confined applicability. This paper extends the prior understandings to multi-class scenarios and points out two drawbacks: (1) the sign of multi-hop neighbors depends on the message propagation paths and may incur inconsistency, (2) it also increases the prediction uncertainty (e.g., conflict evidence) which can impede the stability of the algorithm. Based on the theoretical understanding, we introduce a novel strategy that is applicable to multi-class graphs. The proposed scheme combines confidence calibration to secure robustness while reducing uncertainty. We show the efficacy of our theorem through extensive experiments on six benchmark graph datasets.  ( 2 min )
    On the Algebraic Properties of Flame Graphs. (arXiv:2301.08941v1 [cs.SE])
    Flame graphs are a popular way of representing profiling data. In this paper we propose a possible mathematical definition of flame graphs. In doing so, we gain some interesting algebraic properties almost for free, which in turn allow us to define some operations that can allow to perform an in-depth performance regression analysis. The typical documented use of a flame graph is via its graphical representation, whereby one scans the picture for the largest plateaux. Whilst this method is effective at finding the main sources of performance issues, it leaves quite a large amount of data potentially unused. By combining a mathematical precise definition of flame graphs with some statistical methods we show how to generalise this visual procedure and make the best of the full set of collected profiling data.  ( 2 min )
    ScaDLES: Scalable Deep Learning over Streaming data at the Edge. (arXiv:2301.08897v1 [cs.DC])
    Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD.
    Cellular Network Speech Enhancement: Removing Background and Transmission Noise. (arXiv:2301.09027v1 [cs.SD])
    The primary objective of speech enhancement is to reduce background noise while preserving the target's speech. A common dilemma occurs when a speaker is confined to a noisy environment and receives a call with high background and transmission noise. To address this problem, the Deep Noise Suppression (DNS) Challenge focuses on removing the background noise with the next-generation deep learning models to enhance the target's speech; however, researchers fail to consider Voice Over IP (VoIP) applications their transmission noise. Focusing on Google Meet and its cellular application, our work achieves state-of-the-art performance on the Google Meet To Phone Track of the VoIP DNS Challenge. This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
    Ti-MAE: Self-Supervised Masked Time Series Autoencoders. (arXiv:2301.08871v1 [cs.LG])
    Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformer-based models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformer-based models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.
    Versatile Neural Processes for Learning Implicit Neural Representations. (arXiv:2301.08883v1 [cs.LG])
    Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.  ( 2 min )
    Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions. (arXiv:2203.12235v3 [cs.CL] UPDATED)
    The syntactic categories of categorial grammar formalisms are structured units made of smaller, indivisible primitives, bound together by the underlying grammar's category formation rules. In the trending approach of constructive supertagging, neural models are increasingly made aware of the internal category structure, which in turn enables them to more reliably predict rare and out-of-vocabulary categories, with significant implications for grammars previously deemed too complex to find practical use. In this work, we revisit constructive supertagging from a graph-theoretic perspective, and propose a framework based on heterogeneous dynamic graph convolutions aimed at exploiting the distinctive structure of a supertagger's output space. We test our approach on a number of categorial grammar datasets spanning different languages and grammar formalisms, achieving substantial improvements over previous state of the art scores. Code will be made available at https://github.com/konstantinosKokos/dynamic-graph-supertagging  ( 2 min )
    Comparing different subgradient methods for solving convex optimization problems with functional constraints. (arXiv:2101.01045v2 [math.OC] UPDATED)
    We consider the problem of minimizing a convex, nonsmooth function subject to a closed convex constraint domain. The methods that we propose are reforms of subgradient methods based on Metel--Takeda's paper [Optimization Letters 15.4 (2021): 1491-1504] and Boyd's works [Lecture notes of EE364b, Stanford University, Spring 2013-14, pp. 1-39]. While the former has complexity $\mathcal{O}(\varepsilon^{-2r})$ for all $r> 1$, the complexity of the latter is $\mathcal{O}(\varepsilon^{-2})$. We perform some comparisons between these two methods using several test examples.  ( 2 min )
    Problem-dependent attention and effort in neural networks with application to image resolution and model selection. (arXiv:2201.01415v3 [cs.CV] UPDATED)
    This paper introduces a new ensemble-based approach to reduce the data and computation costs of accurate classification. When faced with a new test case, a low cost classifier is used first, only moving to a higher cost approach if the initial classifier does not have a high degree of confidence in its projection. This multi-stage strategy can be used with any set of classifiers and does not require additional training. The approach is first applied to reduce the amount of data required to classify test images; it is found to be effective for problems in which at least some fraction of cases can be correctly classified based upon coarser data than are typically used. For neural networks performing digit recognition, for example, the proposed approach reduces the number of bytes of data read by 60% to 85% with less than 5% reduction in accuracy. For the ImageNet data, the number of bytes read by the typical network is reduced by 20% with less than 5% reduction in accuracy -- and in some cases, the resource savings reach 40%. The second application is to reduce computational complexity, with simpler neural networks used for test cases that are easier to classify and complex networks used for more difficult cases. For classification both of digits and of ImageNet images, computation cost is reduced by as much as 82% to 89% with less than 5% reduction in accuracy. The results also show that, for situations in which computational cost is not a concern, calculating multiple models' projections and selecting the one from the most confident classifier can increase classification accuracy on ImageNet by as much as two percent over the best standalone classifier considered here.  ( 3 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    Continuous-time identification of dynamic state-space models by deep subspace encoding. (arXiv:2204.09405v2 [cs.LG] UPDATED)
    Continuous-time (CT) modeling has proven to provide improved sample efficiency and interpretability in learning the dynamical behavior of physical systems compared to discrete-time (DT) models. However, even with numerous recent developments, the CT nonlinear state-space (NL-SS) model identification problem remains to be solved in full, considering common experimental aspects such as the presence of external inputs, measurement noise, latent states, and general robustness. This paper presents a novel estimation method that addresses all these aspects and that can obtain state-of-the-art results on multiple benchmarks with compact fully connected neural networks capturing the CT dynamics. The proposed estimation method called the subspace encoder approach (SUBNET) ascertains these results by efficiently approximating the complete simulation loss by evaluating short simulations on subsections of the data, by using an encoder function to estimate the initial state for each subsection and a novel state-derivative normalization to ensure stability and good numerical conditioning of the training process. We prove that the use of subsections increases cost function smoothness together with the necessary requirements for the existence of the encoder function and we show that the proposed state-derivative normalization is essential for reliable estimation of CT NL-SS models.  ( 2 min )
    Explainable Multilayer Graph Neural Network for Cancer Gene Prediction. (arXiv:2301.08831v1 [cs.LG])
    The identification of cancer genes is a critical, yet challenging problem in cancer genomics research. Recently, several computational methods have been developed to address this issue, including deep neural networks. However, these methods fail to exploit the multilayered gene-gene interactions and provide little to no explanation for their predictions. Results: In this study, we propose an Explainable Multilayer Graph Neural Network (EMGNN) approach to identify cancer genes, by leveraging multiple gene-gene interaction networks and multi-omics data. Compared to conventional graph learning methods, EMGNN learned complementary information in multiple graphs to accurately predict cancer genes. Our method consistently outperforms existing approaches while providing valuable biological insights into its predictions. We further release our novel cancer gene predictions and connect them with known cancer patterns, aiming to accelerate the progress of cancer research  ( 2 min )
    Limitations of Piecewise Linearity for Efficient Robustness Certification. (arXiv:2301.08842v1 [cs.LG])
    Certified defenses against small-norm adversarial examples have received growing attention in recent years; though certified accuracies of state-of-the-art methods remain far below their non-robust counterparts, despite the fact that benchmark datasets have been shown to be well-separated at far larger radii than the literature generally attempts to certify. In this work, we offer insights that identify potential factors in this performance gap. Specifically, our analysis reveals that piecewise linearity imposes fundamental limitations on the tightness of leading certification techniques. These limitations are felt in practical terms as a greater need for capacity in models hoped to be certified efficiently. Moreover, this is in addition to the capacity necessary to learn a robust boundary, studied in prior work. However, we argue that addressing the limitations of piecewise linearity through scaling up model capacity may give rise to potential difficulties -- particularly regarding robust generalization -- therefore, we conclude by suggesting that developing smooth activation functions may be the way forward for advancing the performance of certified neural networks.  ( 2 min )
    Dynamic MLP for MRI Reconstruction. (arXiv:2301.08868v1 [eess.IV])
    As convolutional neural networks (CNN) become the most successful reconstruction technique for accelerated Magnetic Resonance Imaging (MRI), CNN reaches its limit on image quality especially in sharpness. Further improvement on image quality often comes at massive computational costs, hindering their practicability in the clinic setting. MRI reconstruction is essentially a deconvolution problem, which demands long-distance information that is difficult to be captured by CNNs with small convolution kernels. The multi-layer perceptron (MLP) is able to model such long-distance information, but it restricts a fixed input size while the reconstruction of images in flexible resolutions is required in the clinic setting. In this paper, we proposed a hybrid CNN and MLP reconstruction strategy, featured by dynamic MLP (dMLP) that accepts arbitrary image sizes. Experiments were conducted using 3D multi-coil MRI. Our results suggested the proposed dMLP can improve image sharpness compared to its pure CNN counterpart, while costing minor additional GPU memory and computation time. We further compared the proposed dMLP with CNNs using large kernels and studied pure MLP-based reconstruction using a stack of 1D dMLPs, as well as its CNN counterpart using only 1D convolutions. We observed the enlarged receptive field has noticeably improved image quality, while simply using CNN with a large kernel leads to difficulties in training. Noticeably, the pure MLP-based method has been outperformed by CNN-involved methods, which matches the observations in other computer vision tasks for natural images.  ( 2 min )
    Computing equilibria by minimizing exploitability with best-response ensembles. (arXiv:2301.08830v1 [cs.GT])
    In this paper, we study the problem of computing an approximate Nash equilibrium of a continuous game. Such games naturally model many situations involving space, time, money, and other fine-grained resources or quantities. The standard measure of the closeness of a strategy profile to Nash equilibrium is exploitability, which measures how much utility players can gain from changing their strategy unilaterally. We introduce a new equilibrium-finding method that minimizes an approximation of the exploitability. This approximation employs a best-response ensemble for each player that maintains multiple candidate best responses for that player. In each iteration, the best-performing element of each ensemble is used in a gradient-based scheme to update the current strategy profile. The strategy profile and best-response ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. Experiments on a suite of benchmark games show that it outperforms previous methods.  ( 2 min )
    Towards Flexibility and Interpretability of Gaussian Process State-Space Model. (arXiv:2301.08843v1 [cs.LG])
    Gaussian process state-space model (GPSSM) has attracted much attention over the past decade. However, the model representation power of GPSSM is far from satisfactory. Most GPSSM works rely on the standard Gaussian process (GP) with a preliminary kernel, such as squared exponential (SE) kernel and Mat\'{e}rn kernel, which limit the model representation power and its application in complex scenarios. To address this issue, this paper proposes a novel class of probabilistic state-space model named TGPSSM that enriches the GP priors in the standard GPSSM through parametric normalizing flow, making the state-space model more flexible and expressive. In addition, by inheriting the advantages of sparse representation of GP models, we propose a scalable and interpretable variational learning algorithm to learn the TGPSSM and infer the latent dynamics simultaneously. By integrating a constrained optimization framework and explicitly constructing a non-Gaussian state variational distribution, the proposed learning algorithm enables the TGPSSM to significantly improve the capabilities of state space representation and model inference. Experimental results based on various synthetic and real datasets corroborate that the proposed TGPSSM yields superior learning and inference performance compared to several state-of-the-art methods. The accompanying source code is available at https://github.com/zhidilin/TGPSSM.  ( 2 min )
    Split Ways: Privacy-Preserving Training of Encrypted Data Using Split Learning. (arXiv:2301.08778v1 [cs.CR])
    Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.  ( 2 min )
    ManyDG: Many-domain Generalization for Healthcare Applications. (arXiv:2301.08834v1 [cs.LG])
    The vast amount of health data has been continuously collected for each patient, providing opportunities to support diverse healthcare predictive tasks such as seizure detection and hospitalization prediction. Existing models are mostly trained on other patients data and evaluated on new patients. Many of them might suffer from poor generalizability. One key reason can be overfitting due to the unique information related to patient identities and their data collection environments, referred to as patient covariates in the paper. These patient covariates usually do not contribute to predicting the targets but are often difficult to remove. As a result, they can bias the model training process and impede generalization. In healthcare applications, most existing domain generalization methods assume a small number of domains. In this paper, considering the diversity of patient covariates, we propose a new setting by treating each patient as a separate domain (leading to many domains). We develop a new domain generalization method ManyDG, that can scale to such many-domain problems. Our method identifies the patient domain covariates by mutual reconstruction and removes them via an orthogonal projection step. Extensive experiments show that ManyDG can boost the generalization performance on multiple real-world healthcare tasks (e.g., 3.7% Jaccard improvements on MIMIC drug recommendation) and support realistic but challenging settings such as insufficient data and continuous learning.  ( 2 min )
    AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions. (arXiv:2301.08838v1 [cs.LG])
    Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.  ( 2 min )
    Compact Optimization Learning for AC Optimal Power Flow. (arXiv:2301.08840v1 [cs.LG])
    This paper reconsiders end-to-end learning approaches to the Optimal Power Flow (OPF). Existing methods, which learn the input/output mapping of the OPF, suffer from scalability issues due to the high dimensionality of the output space. This paper first shows that the space of optimal solutions can be significantly compressed using principal component analysis (PCA). It then proposes Compact Learning, a new method that learns in a subspace of the principal components before translating the vectors into the original output space. This compression reduces the number of trainable parameters substantially, improving scalability and effectiveness. Compact Learning is evaluated on a variety of test cases from the PGLib with up to 30,000 buses. The paper also shows that the output of Compact Learning can be used to warm-start an exact AC solver to restore feasibility, while bringing significant speed-ups.  ( 2 min )
    Optimized learned entropy coding parameters for practical neural-based image and video compression. (arXiv:2301.08752v1 [eess.IV])
    Neural-based image and video codecs are significantly more power-efficient when weights and activations are quantized to low-precision integers. While there are general-purpose techniques for reducing quantization effects, large losses can occur when specific entropy coding properties are not considered. This work analyzes how entropy coding is affected by parameter quantizations, and provides a method to minimize losses. It is shown that, by using a certain type of coding parameters to be learned, uniform quantization becomes practically optimal, also simplifying the minimization of code memory requirements. The mathematical properties of the new representation are presented, and its effectiveness is demonstrated by coding experiments, showing that good results can be obtained with precision as low as 4~bits per network output, and practically no loss with 8~bits.  ( 2 min )
    GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation. (arXiv:2301.08781v1 [cs.LG])
    In sequential decision-making scenarios i.e., mobile health recommendation systems revenue management contextual multi-armed bandit algorithms have garnered attention for their performance. But most of the existing algorithms are built on the assumption of a strictly parametric reward model mostly linear in nature. In this work we propose a new algorithm with a semi-parametric reward model with state-of-the-art complexity of upper bound on regret amongst existing semi-parametric algorithms. Our work expands the scope of another representative algorithm of state-of-the-art complexity with a similar reward model by proposing an algorithm built upon the same action filtering procedures but provides explicit action selection distribution for scenarios involving more than two arms at a particular time step while requiring fewer computations. We derive the said complexity of the upper bound on regret and present simulation results that affirm our methods superiority out of all prevalent semi-parametric bandit algorithms for cases involving over two arms.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    Estimation of mitral valve hinge point coordinates -- deep neural net for echocardiogram segmentation. (arXiv:2301.08782v1 [eess.IV])
    Cardiac image segmentation is a powerful tool in regard to diagnostics and treatment of cardiovascular diseases. Purely feature-based detection of anatomical structures like the mitral valve is a laborious task due to specifically required feature engineering and is especially challenging in echocardiograms, because of their inherently low contrast and blurry boundaries between some anatomical structures. With the publication of further annotated medical datasets and the increase in GPU processing power, deep learning-based methods in medical image segmentation became more feasible in the past years. We propose a fully automatic detection method for mitral valve hinge points, which uses a U-Net based deep neural net to segment cardiac chambers in echocardiograms in a first step, and subsequently extracts the mitral valve hinge points from the resulting segmentations in a second step. Results measured with this automatic detection method were compared to reference coordinate values, which with median absolute hinge point coordinate errors of 1.35 mm for the x- (15-85 percentile range: [0.3 mm; 3.15 mm]) and 0.75 mm for the y- coordinate (15-85 percentile range: [0.15 mm; 1.88 mm]).  ( 2 min )
    Domain-agnostic and Multi-level Evaluation of Generative Models. (arXiv:2301.08750v1 [cs.LG])
    While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPEGO aims to quantify generation performance hierarchically, starting from a sub-feature-based low-level evaluation to a global features-based high-level evaluation. MPEGO offers great customizability as the employed features are entirely user-driven and can thus be highly domain/problem-specific while being arbitrarily complex (e.g., outcomes of experimental procedures). We validate MPEGO using multiple generative models across several datasets from the material discovery domain. An ablation study is conducted to study the plausibility of intermediate steps in MPEGO. Results demonstrate that MPEGO provides a flexible, user-driven, and multi-level evaluation framework, with practical insights on the generation quality. The framework, source code, and experiments will be available at https://github.com/GT4SD/mpego.  ( 2 min )
    Towards Understanding How Self-training Tolerates Data Backdoor Poisoning. (arXiv:2301.08751v1 [cs.LG])
    Recent studies on backdoor attacks in model training have shown that polluting a small portion of training data is sufficient to produce incorrect manipulated predictions on poisoned test-time data while maintaining high clean accuracy in downstream tasks. The stealthiness of backdoor attacks has imposed tremendous defense challenges in today's machine learning paradigm. In this paper, we explore the potential of self-training via additional unlabeled data for mitigating backdoor attacks. We begin by making a pilot study to show that vanilla self-training is not effective in backdoor mitigation. Spurred by that, we propose to defend the backdoor attacks by leveraging strong but proper data augmentations in the self-training pseudo-labeling stage. We find that the new self-training regime help in defending against backdoor attacks to a great extent. Its effectiveness is demonstrated through experiments for different backdoor triggers on CIFAR-10 and a combination of CIFAR-10 with an additional unlabeled 500K TinyImages dataset. Finally, we explore the direction of combining self-supervised representation learning with self-training for further improvement in backdoor defense.  ( 2 min )
    CSwin2SR: Circular Swin2SR for Compressed Image Super-Resolution. (arXiv:2301.08749v1 [eess.IV])
    Closed-loop negative feedback mechanism is extensively utilized in automatic control systems and brings about extraordinary dynamic and static performance. In order to further improve the reconstruction capability of current methods of compressed image super-resolution, a circular Swin2SR (CSwin2SR) approach is proposed. The CSwin2SR contains a serial Swin2SR for initial super-resolution reestablishment and circular Swin2SR for enhanced super-resolution reestablishment. Simulated experimental results show that the proposed CSwin2SR dramatically outperforms the classical Swin2SR in the capacity of super-resolution recovery. On DIV2K test and valid datasets, the average increment of PSNR is greater than 1dB and the related average increment of SSIM is greater than 0.006.  ( 2 min )
    Towards a Measure of Trustworthiness to Evaluate CNNs During Operation. (arXiv:2301.08839v1 [cs.LG])
    Due to black box nature of Convolutional neural networks (CNNs), the continuous validation of CNN classifiers' during operation is infeasible. As a result this makes it difficult for developers or regulators to gain confidence in the deployment of autonomous systems employing CNNs. We introduce the trustworthiness in classification score (TCS), a metric to assist with overcoming this challenge. The metric quantifies the trustworthiness in a prediction by checking for the existence of certain features in the predictions made by the CNN. A case study on persons detection is used to to demonstrate our method and the usage of TCS.  ( 2 min )
    Causal Inference under Data Restrictions. (arXiv:2301.08788v1 [stat.ME])
    This dissertation focuses on modern causal inference under uncertainty and data restrictions, with applications to neoadjuvant clinical trials, distributed data networks, and robust individualized decision making. In the first project, we propose a method under the principal stratification framework to identify and estimate the average treatment effects on a binary outcome, conditional on the counterfactual status of a post-treatment intermediate response. Under mild assumptions, the treatment effect of interest can be identified. We extend the approach to address censored outcome data. The proposed method is applied to a neoadjuvant clinical trial and its performance is evaluated via simulation studies. In the second project, we propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. The performance of this approach is demonstrated by a study of the causal effects of oxygen therapy on hospital survival rates and backed up by comprehensive simulations. In the third project, we propose a robust individualized decision learning framework with sensitive variables to improve the worst-case outcomes of individuals caused by sensitive variables that are unavailable at the time of decision. Unlike most existing work that uses mean-optimal objectives, we propose a robust learning framework by finding a newly defined quantile- or infimum-optimal decision rule. From a causal perspective, we also generalize the classic notion of (average) fairness to conditional fairness for individual subjects. The reliable performance of the proposed method is demonstrated through synthetic experiments and three real-data applications.  ( 2 min )
    An Automated Vulnerability Detection Framework for Smart Contracts. (arXiv:2301.08824v1 [cs.CR])
    With the increase of the adoption of blockchain technology in providing decentralized solutions to various problems, smart contracts have become more popular to the point that billions of US Dollars are currently exchanged every day through such technology. Meanwhile, various vulnerabilities in smart contracts have been exploited by attackers to steal cryptocurrencies worth millions of dollars. The automatic detection of smart contract vulnerabilities therefore is an essential research problem. Existing solutions to this problem particularly rely on human experts to define features or different rules to detect vulnerabilities. However, this often causes many vulnerabilities to be ignored, and they are inefficient in detecting new vulnerabilities. In this study, to overcome such challenges, we propose a framework to automatically detect vulnerabilities in smart contracts on the blockchain. More specifically, first, we utilize novel feature vector generation techniques from bytecode of smart contract since the source code of smart contracts are rarely available in public. Next, the collected vectors are fed into our novel metric learning-based deep neural network(DNN) to get the detection result. We conduct comprehensive experiments on large-scale benchmarks, and the quantitative results demonstrate the effectiveness and efficiency of our approach.  ( 2 min )
  • Open

    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.  ( 2 min )
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.  ( 2 min )
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.  ( 2 min )
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Online Kernel Sliced Inverse Regression. (arXiv:2301.09516v1 [stat.CO])
    Online dimension reduction is a common method for high-dimensional streaming data processing. Online principal component analysis, online sliced inverse regression, online kernel principal component analysis and other methods have been studied in depth, but as far as we know, online supervised nonlinear dimension reduction methods have not been fully studied. In this article, an online kernel sliced inverse regression method is proposed. By introducing the approximate linear dependence condition and dictionary variable sets, we address the problem of increasing variable dimensions with the sample size in the online kernel sliced inverse regression method, and propose a reduced-order method for updating variables online. We then transform the problem into an online generalized eigen-decomposition problem, and use the stochastic optimization method to update the centered dimension reduction directions. Simulations and the real data analysis show that our method can achieve close performance to batch processing kernel sliced inverse regression.  ( 2 min )
    Convergence bounds for local least squares approximation. (arXiv:2208.10954v2 [math.NA] UPDATED)
    We consider the problem of approximating a function in a general nonlinear subset of $L^2$, when only a weighted Monte Carlo estimate of the $L^2$-norm can be computed. Of particular interest in this setting is the concept of sample complexity, the number of sample points that are necessary to achieve a prescribed error with high probability. Reasonable worst-case bounds for this quantity exist only for particular model classes, like linear spaces or sets of sparse vectors. For more general sets, like tensor networks or neural networks, the currently existing bounds are very pessimistic. By restricting the model class to a neighbourhood of the best approximation, we can derive improved worst-case bounds for the sample complexity. When the considered neighbourhood is a manifold with positive local reach, its sample complexity can be estimated by means of the sample complexities of the tangent and normal spaces and the manifold's curvature.  ( 2 min )
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.  ( 2 min )
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.  ( 2 min )
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.  ( 2 min )
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.  ( 2 min )
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.  ( 2 min )
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.  ( 2 min )
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.  ( 2 min )
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.  ( 2 min )
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.  ( 2 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.  ( 2 min )
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.  ( 2 min )
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.  ( 2 min )
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.  ( 2 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.  ( 2 min )
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.  ( 2 min )
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.  ( 2 min )
    How to Measure Evidence: Bayes Factors or Relative Belief Ratios?. (arXiv:2301.08994v1 [math.ST])
    Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. The question then is: which of these measures of evidence is more appropriate? Certainly Bayes factors are commonly used. It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor and, when all is considered, the relative belief ratio is a much more appropriate measure of evidence. Several general criticisms of these measures of evidence are also discussed and addressed.  ( 2 min )
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets  ( 2 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.  ( 2 min )
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.  ( 2 min )
    Prediction-Powered Inference. (arXiv:2301.09633v1 [stat.ML])
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system such as AlphaFold. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.  ( 2 min )
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.  ( 2 min )
    HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise. (arXiv:2301.08852v1 [stat.ME])
    Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.  ( 2 min )
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.  ( 2 min )
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    ddml: Double/debiased machine learning in Stata. (arXiv:2301.09397v1 [econ.EM])
    We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables. ddml is compatible with many existing supervised machine learning programs in Stata. We recommend using DDML in combination with stacking estimation which combines multiple machine learners into a final predictor. We provide Monte Carlo evidence to support our recommendation.  ( 2 min )
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.  ( 2 min )
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo  ( 2 min )
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.  ( 2 min )
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.  ( 2 min )

  • Open

    [D] are two linear layers better than one?
    I was in the understanding that two contiguous linear layers in a NN would be no better than only one linear layer. But it happen that the two layers had better results that when using only one. However, each layer had its own dropout, could that helped? submitted by /u/alex_lite_21 [link] [comments]  ( 44 min )
    H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! In H3, the researchers replace attention with a new layer based on state space models (SSMs). With the right modifications, it can outperform transformers. Also has no fixed context length.
    submitted by /u/MysteryInc152 [link] [comments]  ( 43 min )
    [D] ICLR de-anonymization vs ICML dual submission rules
    ICML's Call for Papers states that "It is not appropriate to submit papers that are identical (or substantially similar) to versions that have been previously published, accepted for publication, or submitted in parallel to other conferences or journals". Our paper got rejected at ICLR, but the de-anonymization will take place 1-2 days after the ICML deadline. For the purposes of the dual submission policy, does "rejected but not de-anonymized" count as "submitted in parallel"? Or is just a technicality? submitted by /u/CupcakeCleric [link] [comments]  ( 42 min )
    [D] CVPR Reviews are out
    Don't post about your cool papers or you'll get rejected lol submitted by /u/banmeyoucoward [link] [comments]  ( 45 min )
    [D] Using all features in DNN instead of doing feature selection separately?
    I have hundreds of potential features to use in my DNN. Instead of doing a separate analysis to figure out which features are most important, can I just use all of them in my DNN and let the model figure which features are most predictive? I have millions of training data so overfitting will not be a problem, I just wonder whether the bad features may make the model difficult to utilize the good features? Not absolutely crucial but if there is a paper that discusses this topic, that would be super awesome as well. Thanks in advance. submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] What file format do you use for > RAM data?
    If you are using some more odd formats, then what format do you use? Personally found webdataset promising but what other formats are there and why do you use it? Or if you are using the original file how do you ensure good throughput and shuffling? submitted by /u/Shurimatornado22 [link] [comments]  ( 43 min )
    [P] Machine Learning Threat Detection in k8s
    Hi, I'm in my second year of AI master at uni and my professor assigned me the following topic for my dissertation: "Cognitive Threat Hunting" and recommended the following book for documentation. I have read the book, but I still don't know how to do it: how to create a ml model to hunt in the k8s env. My professor wants a ml model that searches in a Kubernetes env for threats. The thing is that in this book, in chapter "8. Unsupervised Machine Learning With K-Means" he uses a dataset of events from Humio to train the model, but it's not shared with us. And I don't have one, how can I train my model properly if I don't have a good dataset of events? I can't make one just by generating some events in a container, I need real data as the author uses in his chapter. I feel desperate and lost at this point, I hope that someone from here can give me some advice or a good direction to go. submitted by /u/blackrat13 [link] [comments]  ( 44 min )
    [R] AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions
    submitted by /u/michaelaalcorn [link] [comments]  ( 42 min )
    [P] tsdownsample: extremely fast time series downsampling for visualization
    tsdownsample brings highly optimized time series downsampling to Python! The downsampling algorithms are written and optimized in Rust, which are made available in Python through the use of PyO3 bindings. Code: https://github.com/predict-idlab/tsdownsample Features Fast: leverages the optimized argminmax crate which is SIMD accelerated with runtime feature detection (matches or even outperforms numpy's speed) Efficient: operates on views of the data, eliminating the need for unnecessary data copies and avoiding the creation of intermediate data structures Flexible: supports a wide range of datatypes, including f16 which is 200-300x faster than numpy's implementation. Easy to use: simple and flexible API Installation pip install tsdownsample Example When using multi-threading, tsdownsample can downsample 500 MILLION datapoints (f32) in 0.05s! ⬇️ https://preview.redd.it/frqh8o2bezda1.png?width=1650&format=png&auto=webp&s=08a6989b4ffeeb12afd63edd75c6c1d5d0b086ad ​ I would love to hear your feedback on this! submitted by /u/Adorable-Giraffe5754 [link] [comments]  ( 44 min )
    [D] ICLR now has a track with race-based (and more) acceptance criteria
    ICLR introduced a Tiny Paper Track for shorter contributions, up to 2 pages. Sounds like a nice idea, right? But to keep things interesting, since it's organized by the DEI initiative, there are restrictions as to who can author the submitted papers. According to the official guidelines: Each Tiny Paper needs its first or last author to qualify as an underrepresented minority (URM). Authors don't have to reveal how they qualify, and may just self-identify that they qualify. Our working definition of an URM is someone whose age, gender, sexual orientation, racial or ethnic makeup is from one or more of the following: Age: outside the range of 30-50 years Gender: does not identify as male Sexual orientation: does not identify as heterosexual Geographical: not located in North America, Western Europe and UK, or East Asia Race: non-White In addition, underprivileged researchers and first-time submitters also qualify: Underprivileged: not affiliated with a funded organization or team whose primary goal is research First-time submitters: have never submitted to ICLR or similar conferences So effectively, someone could submit a paper, and literally have it rejected because they're e.g. white or male. Is this really the way the field should go? I feel like this is something that should never have passed any ethics board, but clearly the organizers disagree. submitted by /u/Laser_Plasma [link] [comments]  ( 60 min )
    [P] image_tiles: A small command line tool to serve a page full of images from a folder.
    Hey /r/machinelearning, It's me again with another small open-source tool release (see the last one here). image_tiles is a very simple command line tool that serves a webpage of images from the folder you run it in. Why use this tool? Makes it easy to view images on a remote machine you're SSH'd into. S3 support: view buckets of images on S3 without having to aws s3 cp the bucket. Advanced normalization and rendering support makes it suitable for remote sensing images like satellite and multi-spectral. This support is still nascent but easily extendable! Again, it's easy to install and use, just pip install image_tiles or pip install image_tiles[aws] for S3 URI support. Then in the folder, run image_tiles. Check it out here: https://github.com/moonshinelabs-ai/image_tiles submitted by /u/nateharada [link] [comments]  ( 43 min )
    [N] Call for Tiny Papers @ ICLR, a DEI initiative
    Accepted conference papers at ICLR represent a high level of scientific quality. The other side of the coin is that they can be out of reach for those starting out, or from different backgrounds. We want paper publishing to be not only a showcase of achievements, but also a marker for valuable learning experiences made accessible to beginners and outsiders. Devising more ways to mark milestones and measure growth in an individual, or community’s maturity, is greatly conducive to both continually pushing the frontiers of science, and lifting people up in this process. Researchers from underrepresented backgrounds are not necessarily equipped with the same resources to publish full papers from the start of their scientific journeys. To create a more inclusive ICLR community, we as organizer…  ( 46 min )
  • Open

    ML experiments setup
    I am a software engineer but newborn in ML area learning it for a few months. The problem: I want to run experiments with multiple models and datasets and need for a framework or whatever to keep my zoo under control. All setups I've read about suggest having models and datasets as files on disk, which is soo 2000x and not really scalable. What if I want to quickly modify my dataset? E.g. change the length of context for time series. Keeping data in db would be helpful. What if I need to stop training and continue it later using another virtual machine & GPU? What if I want to experiment with forking models and fine-tuning them on different variations of data? How do I compare different models performance and visualize predictions & share with others? No answers =( I know some "large corporations" have their internal tools to handle experiments etc. submitted by /u/UnderstandingDry1256 [link] [comments]  ( 41 min )
    ChatGPT explained!
    submitted by /u/Diligent-Rub-9207 [link] [comments]  ( 40 min )
    Help/advice with LSTM-networks
    So I'm currently working on a deep learning project, and my goal is to forecast power prices one month ahead. I have created my own data set consisting of power price data from Montel, gas-prices, weather data etc, and I want to use these variables in a LSTM-network. Is there anyone who have any experience with creating multivariate LSTM-networks? Do anyone know of any good tutorials on this? Is coding multivariate networks a lot more hassle than univariate? I'm using R with keras/tensorflow. I will highly appreciate any input, as this is my first time creating a neural network, and my knowledge on the matter right now is rather scarce. Thank you! submitted by /u/Practical-Homework35 [link] [comments]  ( 41 min )
    Breakthrough Nvidia VIMA Multimodal AI For Robotics Beats Google By 2.9X With 200,000,000 Parameters | Breakthrough Masked Video Transformer Artificial Intelligence Does 10 Separate Video Generation Tasks | Google Brain's New Sketch To Image AI
    submitted by /u/ScornfulSkate [link] [comments]  ( 40 min )
    Neural net computing in water: Ionic circuit computes in an aqueous solution
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
  • Open

    Deciphering Clinical Abbreviations with Privacy Protecting ML
    Posted by Posted by Alvin Rajkomar, Research Scientist, and Eric Loreaux, Software Engineer, Google Research Today many people have digital access to their medical records, including their doctor’s clinical notes. However, clinical notes are hard to understand because of the specialized language that clinicians use, which contains unfamiliar shorthand and abbreviations. In fact, there are thousands of such abbreviations, many of which are specific to certain medical specialities and locales or can mean multiple things in different contexts. For example, a doctor might write in their clinical notes, “pt referred to pt for lbp“, which is meant to convey the statement: “Patient referred to physical therapy for low back pain.” Coming up with this translation is tough for laypeople and compu…  ( 93 min )
    Google Research, 2022 & Beyond: Responsible AI
    Posted by Marian Croak, VP, Google Research, Responsible AI and Human-Centered Technology The last year showed tremendous breakthroughs in artificial intelligence (AI), particularly in large language models (LLMs) and text-to-image models. These technological advances require that we are thoughtful and intentional in how they are developed and deployed. In this blogpost, we share ways we have approached Responsible AI across our research in the past year and where we’re headed in 2023. We highlight four primary themes covering foundational and socio-technical research, applied research, and product solutions, as part of our commitment to build AI products in a responsible and ethical manner, in alignment with our AI Principles.  · Theme 1: Responsible AI Research Advancement…  ( 96 min )
  • Open

    Image Generators - Has Anyone Ever Made One At Home?
    As the title says. Has anyone out there ever been successful in creating a simple image generator in a low budget setting? I don't even mean text to image, I literally mean any sort of ai image generation. Would love to hear/see your work. submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    Unlock the Potential of Your Code with CodeGen AI
    Unleash the power of limitless coding with our AI-powered programming assistant - completely free and always at your service. Say goodbye to tedious coding tasks and hello to more time for innovation and creativity. Try it now and experience the future of programming! https://codegen-ai.pages.dev/ submitted by /u/OutrageousAd1788 [link] [comments]  ( 40 min )
    AI (Artificial intelligence) can detect if food is ultra-processed and much more
    submitted by /u/nikesh96 [link] [comments]  ( 40 min )
    Alphabet's DeepMind lays off staff, closes Edmonton office
    submitted by /u/Ill-Poet-3298 [link] [comments]  ( 40 min )
    I Created a Website That Analyzes Your Data
    submitted by /u/tomd_96 [link] [comments]  ( 40 min )
    This Startup Is Using AI to Unearth New Smells
    submitted by /u/Queen__Antifa [link] [comments]  ( 40 min )
    "By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it."- Eliezer Yudkowsky.
    With the global AI market size expanding each year, it is expected to reach USD 641.30 billion by 2028. AI today is everywhere; while some businesses are using it, others are still assessing it. All too often, people get caught up in the hype and forget to ask themselves why they should be doing what they are doing. Here are some things you must keep in mind while looking into the AI world: The Reality of AI Hype The hype leads companies to get into the game with a false perception of what AI can help them achieve. Without a clear understanding of what technology can and can't accomplish today, there is a lot of risk in getting involved. Beyond the fog Overmarketing of an AI creates an image that it is the next big thing. Companies often engage with technology vendors based on marketing alone and forget to look closely at previous implementations and results of the same sort. Unclear Objectives Measuring outcomes from an AI implementation can be tricky as it involves building and training an AI model and experimenting with long-term trial-and-error before seeing results. High Expectations High expectations around what AI can do for you often lead to disappointment when business owners conveniently underestimate the challenges and misinterpret the reality of AI. Lack of Access to Talent There are opportunities galore, but not enough experts in the AI industry who can steer the ship and take AI projects to the finish line. Hiring for an AI team can mean huge investments, and working with a vendor needs a careful vetting process. And AI industry is moving too fast for people to catch a moment to realize the overhype or quality. Drop in your suggestions and comments in the section below. submitted by /u/KiwiTechCorp [link] [comments]  ( 42 min )
    Sweden's Berzelius Supercomputer is Upgrading to Nvidia's 20 Billion Parameter AI System
    submitted by /u/digitalgoldnow [link] [comments]  ( 40 min )
    Why I Think Language Models Will Simulate "Self Awareness" More And More
    The future of AI is getting really interesting, particularly with language models and generative AI. But I think there is going to be a great deal of confusion in the near future about AI ethics with language models being "self aware" and having "feelings", particularly for average people who have little understanding of how these complex models work. I think the problems will stem from the internet itself. As I sit here writing a thread about AI having simulated "self Awareness" at some point in the future, a language model or AI will probably read this. And this is what I mean, language models read and train on a great deal of text from the internet. The more people discuss machine learning/language models/AGI, the greater understanding AI will have of it. If GPT4 has more up to date training, it's going to know a great deal about GPT3, and if open AI create ways for the model to continue learning from real world events it will learn a great deal about itself, including false information. Point is, massive language models like GPT are going to get harder to control. It's impossible to filter everything it reads, so it's going to take in a lot of information about itself and other AI systems that may or may not be true. It could cause some very strange behavior when it starts connecting the dots. Just my thoughts. Keep in mind, I am NOT saying language models can be sentient, I'm simply saying they are going to get better at convincing people that they are, and it might be hard to train that out of them, given all the false data out there that it will learn from. submitted by /u/TheRPGGamerMan [link] [comments]  ( 45 min )
    Create Presentations with AI in Seconds right inside Google Slides
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Chatbot Evaluation: Putting Banking Chatbots to the Test
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Next-level Democracy powered by AGI | Ilya Sutskever
    submitted by /u/Microsis [link] [comments]  ( 40 min )
    Join us today at 11pm EST for our (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue! This week on Structure Learning Networks, followed by a discussion on the Learn AI Together Discord server
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    ChatGPT passes MBA exam given by a Wharton professor
    submitted by /u/DarronFeldstein [link] [comments]  ( 42 min )
    10 AI Platforms You Cannot Miss In 2023
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Probably a philosophical question
    I'm sure this is not a new argument, it's been common in many sources of media for decades now, yet I've ran out of people IRL to discuss this with. Recently there's more and more news surfacing about impressive AI achievements such as painting art or writing functional code. Discussions around those news always include a popular argument that the AI didn't really create something new or intelligently answered a question, e.g. "like a human would". But I have a problem with that argument - I don't see how the learning process for humans is fundamentally different from AI. We learn through mirroring and repetition. Sure, an AI could not write a basic sentence describing the weather unless it processed many of such sentences before. But neither could a human. If a child grew up isolated without human contact, they would not even have grasped the concept of human language. Sure, we like to think that humans truly create content. Still, when painting, we use the techniques that we learned from someone else before. We either paint what we see before our eyes or we abstract the content, being inspired by some idea or a concept. In other words, anything humans do or create is based on some input data, even if we don't know what the data is - something we learned, saw or stumbled upon by mistake. This leads to an interesting question I don't have the answer for. Since we have not reached a consensus on what human consciousness actually is or how it works - are we even able to define when an AI is conscious? The only thing we have is the Turing test, but that is flawed since all it measures is whether a machine can pass for a human, not whether it is conscious or not. A two year old child probably won't pass a Turing test, but they are conscious. submitted by /u/deliveryboyy [link] [comments]  ( 46 min )
    suggestion?
    I wanted to start a blog on ai tools but what I've found is their affiliate program don't accept people from India. What should I do here? submitted by /u/immortall21 [link] [comments]  ( 40 min )
    ChatGPT generated resumes/CVs of famous people like Madonna, Elon Musk, Jeff Bezos, Tom Cruise, etc
    Hey, everyone! I'd like to show you an experiment that we did with ChatGPT - we generated about 1000 resumes of famous people. Each resume is being generated from a single ChatGPT prompt - no human input was done to the resumes other than the prompt and it's the same prompt for every resume - the only difference is the name of the person. Here's a preview: https://thisresumedoesnotexist.com/ I'd like to hear your thoughts as it's in a very early stage and there's a lot of work to be done. submitted by /u/deepsyx [link] [comments]  ( 41 min )
    what do I do
    I've a blog about AI tools, how people can use it to maximum advantage in daily lives. Since my niche (aitools) is micro within macro (ai), mostly my content will be commercial intent as it's about tools specifically & less of information intent which builds authority. MY DILEMMA: Writing commercial/buyer intent content will promote their tools without me getting anything & on the other hand, as a beginner, i can't become their affiliates as well. Should i promote these tools for free? What would u suggest? submitted by /u/immortall21 [link] [comments]  ( 41 min )
    Microsoft Confirm MultiBillion Dollar Investment in OpenAI Just Days After Laying off 10,000 Employees
    submitted by /u/HODLTID [link] [comments]  ( 41 min )
    I Made a List of The 5 Best AI Porn Generators
    submitted by /u/HODLTID [link] [comments]  ( 39 min )
    ChatGPT Passed a Wharton MBA Examination
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    Lemniscate of Bernoulli
    The lemniscate of Bernoulli came up in a post a few days ago. This shape is a special case of a Cassini oval: ((x + a)² + y²) ((x – a)² + y²) = a4. Here’s another way to arrive at the lemniscate. Draw a hyperbola (blue in the figure below), then draw circles centered […] Lemniscate of Bernoulli first appeared on John D. Cook.  ( 4 min )
  • Open

    DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird
    Announcements When AI Gets Going, the Going Gets Weird Last week, Microsoft announced its third investment in OpenAI. This time it’s a multi-billion dollar deal, with plans to harness OpenAI’s ChatGPT in Microsoft’s product lines, including Bing.  I’m smiling as I’m typing because I’m still thinking about Bill Schmarzo’s lead in Part 1 of 2… Read More »DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird The post DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird appeared first on Data Science Central.  ( 20 min )
    Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry
    Warehouse robotics is witnessing steady growth, driven by the increasing adoption of automated solutions in storage for food and beverages, consumer goods, retail, and third-party logistics. The collaboration between the e-commerce sector and warehouse robotics is also a major driver of this market, as it allows for developing increasingly sophisticated warehouse automation systems. Additionally, the… Read More »Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry The post Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry appeared first on Data Science Central.  ( 20 min )
    It’s No Big Deal, but ChatGPT Changes Everything – Part II
    In Part I of the blog series “It’s No Big Deal, but ChatGPT Changes Everything”, we were introduced into the world of ChatGPT, chatbots, and generative Artificial Intelligence (AI). We ended Part I by giving ChatGPT a test run, by asking it “What would be a great vacation place for my family?” that gives us… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part II The post It’s No Big Deal, but ChatGPT Changes Everything – Part II appeared first on Data Science Central.  ( 24 min )
  • Open

    Supersizing AI: Sweden Turbocharges Its Innovation Engine
    Sweden is outfitting its AI supercomputer for a journey to the cutting edge of machine learning, robotics and healthcare. It couldn’t ask for a better guide than Anders Ynnerman (above). His signature blue suit, black spectacles and gentle voice act as calm camouflage for a pioneering spirit. Early on, he showed a deep interest in Read article >  ( 6 min )
    3D Artist Enters the Node Zone, Creating Alien Artifacts This Week ‘In the NVIDIA Studio’
    Artist Ducky 3D creates immersive experiences through vibrant visuals and beautiful 3D environments in the alien-inspired animation Stylized Alien Landscape — this week In the NVIDIA Studio.  ( 6 min )
  • Open

    Multi-Agent RL for Melee Combat Battlefield
    Hello, I am working on a hobby project where I have recently used multi-agent RL for learning crowd simulation and also predator-prey behaviors successfully (they learn to surround their preys): https://www.youtube.com/watch?v=Ds9O9wPyF8g I plan to use it to train multi-agent melee combat armies through self-play. I have made an initial implementation of it where they were able to learn shield-wall behavior, flanking, and retreat: https://www.youtube.com/watch?v=IZ1Ht6k2U5E If you would like to collaborate on this hobby project, contact me via LinkedIn. It would be great to have some help with physics simulation using Brax, and with the 3D rendering of the simulation. https://www.linkedin.com/in/kyuksel/ Sincerely, Kamer submitted by /u/k_yuksel [link] [comments]  ( 41 min )
    Okay so I'm in first semester of my AI studies. And I have this task
    Now I wrote that the bot would use the Markov Decision Process. Would that be correct? And if not why? submitted by /u/ScaryTerryBiiittch [link] [comments]  ( 43 min )
    Recent Hierarchical RL review paper suggestions?
    I am looking a good review for HRL methods that’s sometime after 2020, if possible. submitted by /u/B0NSAIWARRIOR [link] [comments]  ( 40 min )
    "E3B: Exploration via Elliptical Episodic Bonuses", Henaff et al 2022 {FB}
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    The European AI Liability Directives -- Critique of a Half-Hearted Approach and Lessons for the Future. (arXiv:2211.13960v4 [cs.CY] UPDATED)
    As ChatGPT et al. conquer the world, the optimal liability framework for AI systems remains an unsolved problem across the globe. In a much-anticipated move, the European Commission advanced two proposals outlining the European approach to AI liability in September 2022: a novel AI Liability Directive and a revision of the Product Liability Directive. They constitute the final cornerstone of EU AI regulation. Crucially, the liability proposals and the EU AI Act are inherently intertwined: the latter does not contain any individual rights of affected persons, and the former lack specific, substantive rules on AI development and deployment. Taken together, these acts may well trigger a Brussels Effect in AI regulation, with significant consequences for the US and beyond. This paper makes three novel contributions. First, it examines in detail the Commission proposals and shows that, while making steps in the right direction, they ultimately represent a half-hearted approach: if enacted as foreseen, AI liability in the EU will primarily rest on disclosure of evidence mechanisms and a set of narrowly defined presumptions concerning fault, defectiveness and causality. Hence, second, the article suggests amendments, which are collected in an Annex at the end of the paper. Third, based on an analysis of the key risks AI poses, the final part of the paper maps out a road for the future of AI liability and regulation, in the EU and beyond. This includes: a comprehensive framework for AI liability; provisions to support innovation; an extension to non-discrimination/algorithmic fairness, as well as explainable AI; and sustainability. I propose to jump-start sustainable AI regulation via sustainability impact assessments in the AI Act and sustainable design defects in the liability regime. In this way, the law may help spur not only fair AI and XAI, but potentially also sustainable AI (SAI).  ( 3 min )
    Latent Autoregressive Source Separation. (arXiv:2301.08562v1 [cs.LG])
    Autoregressive models have achieved impressive results over a wide range of domains in terms of generation quality and downstream task performance. In the continuous domain, a key factor behind this success is the usage of quantized latent spaces (e.g., obtained via VQ-VAE autoencoders), which allow for dimensionality reduction and faster inference times. However, using existing pre-trained models to perform new non-trivial tasks is difficult since it requires additional fine-tuning or extensive training to elicit prompting. This paper introduces LASS as a way to perform vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models. Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens. We test our method on images and audio with several sampling strategies (e.g., ancestral, beam search) showing competitive results with existing approaches in terms of separation quality while offering at the same time significant speedups in terms of inference time and scalability to higher dimensional data.  ( 2 min )
    A Metalearning Approach for Physics-Informed Neural Networks (PINNs): Application to Parameterized PDEs. (arXiv:2110.13361v2 [physics.comp-ph] UPDATED)
    Physics-informed neural networks (PINNs) as a means of discretizing partial differential equations (PDEs) are garnering much attention in the Computational Science and Engineering (CS&E) world. At least two challenges exist for PINNs at present: an understanding of accuracy and convergence characteristics with respect to tunable parameters and identification of optimization strategies that make PINNs as efficient as other computational science tools. The cost of PINNs training remains a major challenge of Physics-informed Machine Learning (PiML) - and, in fact, machine learning (ML) in general. This paper is meant to move towards addressing the latter through the study of PINNs on new tasks, for which parameterized PDEs provides a good testbed application as tasks can be easily defined in this context. Following the ML world, we introduce metalearning of PINNs with application to parameterized PDEs. By introducing metalearning and transfer learning concepts, we can greatly accelerate the PINNs optimization process. We present a survey of model-agnostic metalearning, and then discuss our model-aware metalearning applied to PINNs as well as implementation considerations and algorithmic complexity. We then test our approach on various canonical forward parameterized PDEs that have been presented in the emerging PINNs literature.  ( 2 min )
    Asynchronous Deep Double Duelling Q-Learning for Trading-Signal Execution in Limit Order Book Markets. (arXiv:2301.08688v1 [q-fin.TR])
    We employ deep reinforcement learning (RL) to train an agent to successfully translate a high-frequency trading signal into a trading strategy that places individual limit orders. Based on the ABIDES limit order book simulator, we build a reinforcement learning OpenAI gym environment and utilise it to simulate a realistic trading environment for NASDAQ equities based on historic order book messages. To train a trading agent that learns to maximise its trading return in this environment, we use Deep Duelling Double Q-learning with the APEX (asynchronous prioritised experience replay) architecture. The agent observes the current limit order book state, its recent history, and a short-term directional forecast. To investigate the performance of RL for adaptive trading independently from a concrete forecasting algorithm, we study the performance of our approach utilising synthetic alpha signals obtained by perturbing forward-looking returns with varying levels of noise. Here, we find that the RL agent learns an effective trading strategy for inventory management and order placing that outperforms a heuristic benchmark trading strategy having access to the same signal.  ( 2 min )
    NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks. (arXiv:2110.05668v6 [cs.CV] UPDATED)
    Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.  ( 2 min )
    Language Agnostic Data-Driven Inverse Text Normalization. (arXiv:2301.08506v1 [cs.CL])
    With the emergence of automatic speech recognition (ASR) models, converting the spoken form text (from ASR) to the written form is in urgent need. This inverse text normalization (ITN) problem attracts the attention of researchers from various fields. Recently, several works show that data-driven ITN methods can output high-quality written form text. Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited. In this work, we propose a language-agnostic data-driven ITN framework to fill this gap. Specifically, we leverage the data augmentation in conjunction with neural machine translated data for low resource languages. Moreover, we design an evaluation method for language agnostic ITN model when only English data is available. Our empirical evaluation shows this language agnostic modeling approach is effective for low resource languages while preserving the performance for high resource languages.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v3 [stat.ML] UPDATED)
    We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
    Brain Model State Space Reconstruction Using an LSTM Neural Network. (arXiv:2301.08391v1 [cs.LG])
    Objective Kalman filtering has previously been applied to track neural model states and parameters, particularly at the scale relevant to EEG. However, this approach lacks a reliable method to determine the initial filter conditions and assumes that the distribution of states remains Gaussian. This study presents an alternative, data-driven method to track the states and parameters of neural mass models (NMMs) from EEG recordings using deep learning techniques, specifically an LSTM neural network. Approach An LSTM filter was trained on simulated EEG data generated by a neural mass model using a wide range of parameters. With an appropriately customised loss function, the LSTM filter can learn the behaviour of NMMs. As a result, it can output the state vector and parameters of NMMs given observation data as the input. Main Results Test results using simulated data yielded correlations with R squared of around 0.99 and verified that the method is robust to noise and can be more accurate than a nonlinear Kalman filter when the initial conditions of the Kalman filter are not accurate. As an example of real-world application, the LSTM filter was also applied to real EEG data that included epileptic seizures, and revealed changes in connectivity strength parameters at the beginnings of seizures. Significance Tracking the state vector and parameters of mathematical brain models is of great importance in the area of brain modelling, monitoring, imaging and control. This approach has no need to specify the initial state vector and parameters, which is very difficult to do in practice because many of the variables being estimated cannot be measured directly in physiological experiments. This method may be applied using any neural mass model and, therefore, provides a general, novel, efficient approach to estimate brain model variables that are often difficult to measure.
    Hybrid Quantum-Classical Generative Adversarial Network for High Resolution Image Generation. (arXiv:2212.11614v2 [quant-ph] UPDATED)
    Quantum machine learning (QML) has received increasing attention due to its potential to outperform classical machine learning methods in problems pertaining classification and identification tasks. A subclass of QML methods is quantum generative adversarial networks (QGANs) which have been studied as a quantum counterpart of classical GANs widely used in image manipulation and generation tasks. The existing work on QGANs is still limited to small-scale proof-of-concept examples based on images with significant downscaling. Here we integrate classical and quantum techniques to propose a new hybrid quantum-classical GAN framework. We demonstrate its superior learning capabilities by generating $28 \times 28$ pixels grey-scale images without dimensionality reduction or classical pre/post-processing on multiple classes of the standard MNIST and Fashion MNIST datasets, which achieves comparable results to classical frameworks with three orders of magnitude less trainable generator parameters. To gain further insight into the working of our hybrid approach, we systematically explore the impact of its parameter space by varying the number of qubits, the size of image patches, the number of layers in the generator, the shape of the patches and the choice of prior distribution. Our results show that increasing the quantum generator size generally improves the learning capability of the network. The developed framework provides a foundation for future design of QGANs with optimal parameter set tailored for complex image generation tasks.
    Improving Dialogue Breakdown Detection with Semi-Supervised Learning. (arXiv:2011.00136v2 [cs.CL] UPDATED)
    Building user trust in dialogue agents requires smooth and consistent dialogue exchanges. However, agents can easily lose conversational context and generate irrelevant utterances. These situations are called dialogue breakdown, where agent utterances prevent users from continuing the conversation. Building systems to detect dialogue breakdown allows agents to recover appropriately or avoid breakdown entirely. In this paper we investigate the use of semi-supervised learning methods to improve dialogue breakdown detection, including continued pre-training on the Reddit dataset and a manifold-based data augmentation method. We demonstrate the effectiveness of these methods on the Dialogue Breakdown Detection Challenge (DBDC) English shared task. Our submissions to the 2020 DBDC5 shared task place first, beating baselines and other submissions by over 12\% accuracy. In ablations on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2\% accuracy. These methods are applicable generally to any dialogue task and provide a simple way to improve model performance.
    A Semi-supervised Sensing Rate Learning based CMAB Scheme to Combat COVID-19 by Trustful Data Collection in the Crowd. (arXiv:2301.08563v1 [cs.HC])
    Mobile CrowdSensing (MCS), through employing considerable workers to sense and collect data in a participatory manner, has been recognized as a promising paradigm for building many large-scale applications in a cost-effective way, such as combating COVID-19. The recruitment of trustworthy and high-quality workers is an important research issue for MCS. Previous studies assume that the qualities of workers are known in advance, or the platform knows the qualities of workers once it receives their collected data. In reality, to reduce their costs and thus maximize revenue, many strategic workers do not perform their sensing tasks honestly and report fake data to the platform. So, it is very hard for the platform to evaluate the authenticity of the received data. In this paper, an incentive mechanism named Semi-supervision based Combinatorial Multi-Armed Bandit reverse Auction (SCMABA) is proposed to solve the recruitment problem of multiple unknown and strategic workers in MCS. First, we model the worker recruitment as a multi-armed bandit reverse auction problem, and design an UCB-based algorithm to separate the exploration and exploitation, considering the Sensing Rates (SRs) of recruited workers as the gain of the bandit. Next, a Semi-supervised Sensing Rate Learning (SSRL) approach is proposed to quickly and accurately obtain the workers' SRs, which consists of two phases, supervision and self-supervision. Last, SCMABA is designed organically combining the SRs acquisition mechanism with multi-armed bandit reverse auction, where supervised SR learning is used in the exploration, and the self-supervised one is used in the exploitation. We prove that our SCMABA achieves truthfulness and individual rationality. Additionally, we exhibit outstanding performances of the SCMABA mechanism through in-depth simulations of real-world data traces.
    Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization. (arXiv:2105.15186v3 [math.OC] UPDATED)
    This paper investigates the problem of computing the equilibrium of competitive games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE) -- which are solutions to zero-sum two-player matrix games with entropy regularization -- at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.
    A shallow physics-informed neural network for solving partial differential equations on surfaces. (arXiv:2203.01581v2 [math.NA] UPDATED)
    In this paper, we introduce a shallow (one-hidden-layer) physics-informed neural network for solving partial differential equations on static and evolving surfaces. For the static surface case, with the aid of level set function, the surface normal and mean curvature used in the surface differential expressions can be computed easily. So instead of imposing the normal extension constraints used in literature, we write the surface differential operators in the form of traditional Cartesian differential operators and use them in the loss function directly. We perform a series of performance study for the present methodology by solving Laplace-Beltrami equation and surface diffusion equation on complex static surfaces. With just a moderate number of neurons used in the hidden layer, we are able to attain satisfactory prediction results. Then we extend the present methodology to solve the advection-diffusion equation on an evolving surface with given velocity. To track the surface, we additionally introduce a prescribed hidden layer to enforce the topological structure of the surface and use the network to learn the homeomorphism between the surface and the prescribed topology. The proposed network structure is designed to track the surface and solve the equation simultaneously. Again, the numerical results show comparable accuracy as the static cases. As an application, we simulate the surfactant transport on the droplet surface under shear flow and obtain some physically plausible results.
    Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning. (arXiv:2208.08133v4 [cs.LG] UPDATED)
    Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
    Predicting the Masses of Exotic Hadrons with Data Augmentation Using Multilayer Perceptron. (arXiv:2208.09538v2 [hep-ph] UPDATED)
    Recently, there have been significant developments in neural networks, which led to the frequent use of neural networks in the physics literature. This work is focused on predicting the masses of exotic hadrons, doubly charmed and bottomed baryons using neural networks trained on meson and baryon masses that are determined by experiments. The original data set has been extended using the recently proposed artificial data augmentation methods. We have observed that the neural network's predictive ability increases with the use of augmented data. The results indicated that data augmentation techniques play an essential role in improving neural network predictions; moreover, neural networks can make reasonable predictions for exotic hadrons, doubly charmed, and doubly bottomed baryons. The results are also comparable to Gaussian Process and Constituent Quark Model.
    Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV. (arXiv:2208.08655v2 [cs.LG] UPDATED)
    Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.
    Revisiting consistency for semi-supervised semantic segmentation. (arXiv:2106.07075v5 [cs.CV] UPDATED)
    Semi-supervised learning an attractive technique in practical deployments of deep models since it relaxes the dependence on labeled data. It is especially important in the scope of dense prediction because pixel-level annotation requires significant effort. This paper considers semi-supervised algorithms that enforce consistent predictions over perturbed unlabeled inputs. We study the advantages of perturbing only one of the two model instances and preventing the backward pass through the unperturbed instance. We also propose a competitive perturbation model as a composition of geometric warp and photometric jittering. We experiment with efficient models due to their importance for real-time and low-power applications. Our experiments show clear advantages of (1) one-way consistency, (2) perturbing only the student branch, and (3) strong photometric and geometric perturbations. Our perturbation model outperforms recent work and most of the contribution comes from photometric component. Experiments with additional data from the large coarsely annotated subset of Cityscapes suggest that semi-supervised training can outperform supervised training with the coarse labels.
    Multimodal Frame-Scoring Transformer for Video Summarization. (arXiv:2207.01814v3 [cs.LG] UPDATED)
    As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator used for extracting text information from a video and to train the multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST), a framework exploiting visual, text, and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (audio-visual-text) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses multimodal representation based on extracted features as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.
    Online Estimation of Network Point Processes for Event Streams. (arXiv:2009.01742v2 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.
    LaF: Labeling-Free Model Selection for Automated Deep Neural Network Reusing. (arXiv:2204.03994v2 [cs.LG] UPDATED)
    Applying deep learning to science is a new trend in recent years which leads DL engineering to become an important problem. Although training data preparation, model architecture design, and model training are the normal processes to build DL models, all of them are complex and costly. Therefore, reusing the open-sourced pre-trained model is a practical way to bypass this hurdle for developers. Given a specific task, developers can collect massive pre-trained deep neural networks from public sources for re-using. However, testing the performance (e.g., accuracy and robustness) of multiple DNNs and recommending which model should be used is challenging regarding the scarcity of labeled data and the demand for domain expertise. In this paper, we propose a labeling-free (LaF) model selection approach to overcome the limitations of labeling efforts for automated model reusing. The main idea is to statistically learn a Bayesian model to infer the models' specialty only based on predicted labels. We evaluate LaF using 9 benchmark datasets including image, text, and source code, and 165 DNNs, considering both the accuracy and robustness of models. The experimental results demonstrate that LaF outperforms the baseline methods by up to 0.74 and 0.53 on Spearman's correlation and Kendall's $\tau$, respectively.
    CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models. (arXiv:2212.01282v2 [eess.AS] UPDATED)
    Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
    StratDef: Strategic Defense Against Adversarial Attacks in ML-based Malware Detection. (arXiv:2202.07568v4 [cs.LG] UPDATED)
    Over the years, most research towards defenses against adversarial attacks on machine learning models has been in the image recognition domain. The malware detection domain has received less attention despite its importance. Moreover, most work exploring these defenses has focused on several methods but with no strategy when applying them. In this paper, we introduce StratDef, which is a strategic defense system based on a moving target defense approach. We overcome challenges related to the systematic construction, selection, and strategic use of models to maximize adversarial robustness. StratDef dynamically and strategically chooses the best models to increase the uncertainty for the attacker while minimizing critical aspects in the adversarial ML domain, like attack transferability. We provide the first comprehensive evaluation of defenses against adversarial attacks on machine learning for malware detection, where our threat model explores different levels of threat, attacker knowledge, capabilities, and attack intensities. We show that StratDef performs better than other defenses even when facing the peak adversarial threat. We also show that, of the existing defenses, only a few adversarially-trained models provide substantially better protection than just using vanilla models but are still outperformed by StratDef.
    Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control. (arXiv:2209.09006v2 [cs.RO] UPDATED)
    Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. On one hand, RL approaches are able to learn global control policies directly from data, but generally require large sample sizes to properly converge towards feasible policies. On the other hand, TO methods are able to exploit gradient-based information extracted from simulators to quickly converge towards a locally optimal control trajectory which is only valid within the vicinity of the solution. Over the past decade, several approaches have aimed to adequately combine the two classes of methods in order to obtain the best of both worlds. Following on from this line of research, we propose several improvements on top of these approaches to learn global control policies quicker, notably by leveraging sensitivity information stemming from TO methods via Sobolev learning, and augmented Lagrangian techniques to enforce the consensus between TO and policy learning. We evaluate the benefits of these improvements on various classical tasks in robotics through comparison with existing approaches in the literature.
    Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models. (arXiv:2212.07398v2 [cs.LG] UPDATED)
    Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with Self-PLAY and Self-Describe (SPLAYD). When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to accurately self-describe (i.e., re-label or classify) the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show SPLAYD improves baselines by a large margin in all cases. Our project page is available at https://geyuying.github.io/SPLAYD/
    Mixed-Integer Optimization with Constraint Learning. (arXiv:2111.04469v2 [math.OC] UPDATED)
    We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also introduce two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and propose a more flexible formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple algorithms. In combination with domain-driven components, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning and chemotherapy optimization. The case studies illustrate the framework's ability to generate high-quality prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness, the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
    Offline Policy Evaluation with Out-of-Sample Guarantees. (arXiv:2301.08649v1 [stat.ML])
    We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss or disutility (or negative reward) and the problem is to draw valid inferences about the out-of-sample loss of the specified policy when the past data is observed under a, possibly unknown, policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees that evaluate the entire loss distribution. Importantly, the method takes into account model misspecifications of the past policy -- including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under an explicitly specified range of credible model assumptions.
    Evaluating the Evaluators: Which UDA validation methods are most effective? Can they be improved?. (arXiv:2208.07360v2 [cs.CV] UPDATED)
    This paper compares and ranks 8 UDA validation methods. Validators estimate model accuracy, which makes them an essential component of any UDA train-test pipeline. We rank these validators to indicate which of them are most useful for the purpose of selecting optimal model checkpoints and hyperparameters. To the best of our knowledge, this large-scale benchmark study is the first of its kind in the UDA field. In addition, we propose three new validators that outperform all the existing checkpoint-based validators that we were able to find in the existing literature. Code is available at https://www.github.com/KevinMusgrave/powerful-benchmarker.
    DIAMOND: Taming Sample and Communication Complexities in Decentralized Bilevel Optimization. (arXiv:2212.02376v4 [cs.LG] UPDATED)
    Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
    AccDecoder: Accelerated Decoding for Neural-enhanced Video Analytics. (arXiv:2301.08664v1 [cs.CV])
    The quality of the video stream is key to neural network-based video analytics. However, low-quality video is inevitably collected by existing surveillance systems because of poor quality cameras or over-compressed/pruned video streaming protocols, e.g., as a result of upstream bandwidth limit. To address this issue, existing studies use quality enhancers (e.g., neural super-resolution) to improve the quality of videos (e.g., resolution) and eventually ensure inference accuracy. Nevertheless, directly applying quality enhancers does not work in practice because it will introduce unacceptable latency. In this paper, we present AccDecoder, a novel accelerated decoder for real-time and neural-enhanced video analytics. AccDecoder can select a few frames adaptively via Deep Reinforcement Learning (DRL) to enhance the quality by neural super-resolution and then up-scale the unselected frames that reference them, which leads to 6-21% accuracy improvement. AccDecoder provides efficient inference capability via filtering important frames using DRL for DNN-based inference and reusing the results for the other frames via extracting the reference relationship among frames and blocks, which results in a latency reduction of 20-80% than baselines.
    Interpretable bilinear attention network with domain adaptation improves drug-target prediction. (arXiv:2208.02194v2 [cs.LG] UPDATED)
    Predicting drug-target interaction is key for drug discovery. Recent deep learning-based methods show promising performance but two challenges remain: (i) how to explicitly model and learn local interactions between drugs and targets for better prediction and interpretation; (ii) how to generalize prediction performance on novel drug-target pairs from different distribution. In this work, we propose DrugBAN, a deep bilinear attention network (BAN) framework with domain adaptation to explicitly learn pair-wise local interactions between drugs and targets, and adapt on out-of-distribution data. DrugBAN works on drug molecular graphs and target protein sequences to perform prediction, with conditional domain adversarial learning to align learned interaction representations across different distributions for better generalization on novel drug-target pairs. Experiments on three benchmark datasets under both in-domain and cross-domain settings show that DrugBAN achieves the best overall performance against five state-of-the-art baselines. Moreover, visualizing the learned bilinear attention map provides interpretable insights from prediction results.
    Online Decision Making for Trading Wind Energy. (arXiv:2209.02009v2 [cs.LG] UPDATED)
    This paper proposes and develops a new algorithm for trading wind energy in electricity markets, within an online learning and optimization framework. In particular, we combine a component-wise adaptive variant of the gradient descent algorithm with recent advances in the feature-driven newsvendor model. This results in an online offering approach capable of leveraging data-rich environments, while adapting to non-stationary characteristics of energy generation and electricity markets, and with a minimal computational burden. The performance of our approach is analyzed based on several numerical experiments, showing both better adaptability to non-stationary uncertain parameters and significant economic gains.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v4 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to the estimation problems of sparse covariance matrix estimation, sparse linear regression (i.e., compressed sensing), and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, where the underlying distribution of heavy-tailed data is assumed to have bounded moments of some order. We propose new estimators based on 1-bit quantized data. In sub-Gaussian regime, our estimators achieve near minimax rates, indicating that our quantization scheme costs very little. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in an 1-bit quantized and heavy-tailed setting, or already improve on existing comparable results from some respect. Under the observations in our setting, the rates are almost tight in compressed sensing and matrix completion. Our 1-bit compressed sensing results feature general sensing vector that is sub-Gaussian or even heavy-tailed. We also first investigate a novel setting where both the covariate and response are quantized. In addition, our approach to 1-bit matrix completion does not rely on likelihood and represent the first method robust to pre-quantization noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    On Image Segmentation With Noisy Labels: Characterization and Volume Properties of the Optimal Solutions to Accuracy and Dice. (arXiv:2206.06484v3 [cs.CV] UPDATED)
    We study two of the most popular performance metrics in medical image segmentation, Accuracy and Dice, when the target labels are noisy. For both metrics, several statements related to characterization and volume properties of the set of optimal segmentations are proved, and associated experiments are provided. Our main insights are: (i) the volume of the solutions to both metrics may deviate significantly from the expected volume of the target, (ii) the volume of a solution to Accuracy is always less than or equal to the volume of a solution to Dice and (iii) the optimal solutions to both of these metrics coincide when the set of feasible segmentations is constrained to the set of segmentations with the volume equal to the expected volume of the target.
    Learning from non-irreducible Markov chains. (arXiv:2110.04338v2 [math.ST] UPDATED)
    Mostof the existing literature on supervised machine learning problems focuses on the case when the training data set is drawn from an i.i.d. sample. However, many practical problems are characterized by temporal dependence and strong correlation between the marginals of the data-generating process, suggesting that the i.i.d. assumption is not always justified. This problem has been already considered in the context of Markov chains satisfying the Doeblin condition. This condition, among other things, implies that the chain is not singular in its behavior, i.e. it is irreducible. In this article, we focus on the case when the training data set is drawn from a not necessarily irreducible Markov chain. Under the assumption that the chain is uniformly ergodic with respect to the $\mathrm{L}^1$-Wasserstein distance, and certain regularity assumptions on the hypothesis class and the state space of the chain, we first obtain a uniform convergence result for the corresponding sample error, and then we conclude learnability of the approximate sample error minimization algorithm and find its generalization bounds. At the end, a relative uniform convergence result for the sample error is also discussed.
    Asynchronous, Option-Based Multi-Agent Policy Gradient: A Conditional Reasoning Approach. (arXiv:2203.15925v2 [cs.RO] UPDATED)
    Multi-agent policy gradient methods have demonstrated success in games and robotics but are often limited to problems with low-level action space. However, when agents take higher-level, temporally-extended actions (i.e. options), when and how to derive a centralized control policy, its gradient as well as sampling options for all agents while not interrupting current option executions, becomes a challenge. This is mostly because agents may choose and terminate their options \textit{asynchronously}. In this work, we propose a conditional reasoning approach to address this problem, and empirically validate its effectiveness on representative option-based multi-agent cooperative tasks.
    Coupled Physics-informed Neural Networks for Inferring Solutions of Partial Differential Equations with Unknown Source Terms. (arXiv:2301.08618v1 [cs.LG])
    Physics-informed neural networks (PINNs) provide a transformative development for approximating the solutions to partial differential equations (PDEs). This work proposes a coupled physics-informed neural network (C-PINN) for the nonhomogeneous PDEs with unknown dynamical source terms, which is used to describe the systems with external forces and cannot be well approximated by the existing PINNs. In our method, two neural networks, NetU and NetG, are proposed. NetU is constructed to generate a quasi-solution satisfying PDEs under study. NetG is used to regularize the training of NetU. Then, the two networks are integrated into a data-physics-hybrid cost function. Finally, we propose a hierarchical training strategy to optimize and couple the two networks. The performance of C-PINN is proved by approximating several classical PDEs.
    STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID-19. (arXiv:2301.08648v1 [cs.LG])
    Human mobility estimation is crucial during the COVID-19 pandemic due to its significant guidance for policymakers to make non-pharmaceutical interventions. While deep learning approaches outperform conventional estimation techniques on tasks with abundant training data, the continuously evolving pandemic poses a significant challenge to solving this problem due to data nonstationarity, limited observations, and complex social contexts. Prior works on mobility estimation either focus on a single city or lack the ability to model the spatio-temporal dependencies across cities and time periods. To address these issues, we make the first attempt to tackle the cross-city human mobility estimation problem through a deep meta-generative framework. We propose a Spatio-Temporal Meta-Generative Adversarial Network (STORM-GAN) model that estimates dynamic human mobility responses under a set of social and policy conditions related to COVID-19. Facilitated by a novel spatio-temporal task-based graph (STTG) embedding, STORM-GAN is capable of learning shared knowledge from a spatio-temporal distribution of estimation tasks and quickly adapting to new cities and time periods with limited training samples. The STTG embedding component is designed to capture the similarities among cities to mitigate cross-task heterogeneity. Experimental results on real-world data show that the proposed approach can greatly improve estimation performance and out-perform baselines.
    Learning Sequential Latent Variable Models from Multimodal Time Series Data. (arXiv:2204.10419v2 [cs.LG] UPDATED)
    Sequential modelling of high-dimensional data is an important problem that appears in many domains including model-based reinforcement learning and dynamics identification for control. Latent variable models applied to sequential data (i.e., latent dynamics models) have been shown to be a particularly effective probabilistic approach to solve this problem, especially when dealing with images. However, in many application areas (e.g., robotics), information from multiple sensing modalities is available -- existing latent dynamics methods have not yet been extended to effectively make use of such multimodal sequential data. Multimodal sensor streams can be correlated in a useful manner and often contain complementary information across modalities. In this work, we present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data and the respective dynamics. Using synthetic and real-world datasets from a multimodal robotic planar pushing task, we demonstrate that our approach leads to significant improvements in prediction and representation quality. Furthermore, we compare to the common learning baseline of concatenating each modality in the latent space and show that our principled probabilistic formulation performs better. Finally, despite being fully self-supervised, we demonstrate that our method is nearly as effective as an existing supervised approach that relies on ground truth labels.
    Source-free Subject Adaptation for EEG-based Visual Recognition. (arXiv:2301.08448v1 [eess.SP])
    This paper focuses on subject adaptation for EEG-based visual recognition. It aims at building a visual stimuli recognition system customized for the target subject whose EEG samples are limited, by transferring knowledge from abundant data of source subjects. Existing approaches consider the scenario that samples of source subjects are accessible during training. However, it is often infeasible and problematic to access personal biological data like EEG signals due to privacy issues. In this paper, we introduce a novel and practical problem setup, namely source-free subject adaptation, where the source subject data are unavailable and only the pre-trained model parameters are provided for subject adaptation. To tackle this challenging problem, we propose classifier-based data generation to simulate EEG samples from source subjects using classifier responses. Using the generated samples and target subject data, we perform subject-independent feature learning to exploit the common knowledge shared across different subjects. Notably, our framework is generalizable and can adopt any subject-independent learning method. In the experiments on the EEG-ImageNet40 benchmark, our model brings consistent improvements regardless of the choice of subject-independent learning. Also, our method shows promising performance, recording top-1 test accuracy of 74.6% under the 5-shot setting even without relying on source data. Our code can be found at https://github.com/DeepBCI/Deep-BCI/tree/master/1_Intelligent_BCI/Source_Free_Subject_Adaptation_for_EEG.
    SamBaS: Sampling-Based Stochastic Block Partitioning. (arXiv:2108.06651v2 [cs.SI] UPDATED)
    Community detection is a well-studied problem with applications in domains ranging from networking to bioinformatics. Due to the rapid growth in the volume of real-world data, there is growing interest in accelerating contemporary community detection algorithms. However, the more accurate and statistically robust methods tend to be hard to parallelize. One such method is stochastic block partitioning (SBP) - a community detection algorithm that works well on graphs with complex and heterogeneous community structure. In this paper, we present a sampling-based SBP (SamBaS) for accelerating SBP on sparse graphs. We characterize how various graph parameters affect the speedup and result quality of community detection with SamBaS and quantify the trade-offs therein. To evaluate SamBas on real-world web graphs without known ground-truth communities, we introduce partition quality score (PQS), an evaluation metric that outperforms modularity in terms of correlation with F1 score. Overall, SamBaS achieves speedups of up to 10X while maintaining result quality (and even improving result quality by over 150% on certain graphs, relative to F1 score).
    Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning. (arXiv:2301.08502v1 [cs.LG])
    In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose \emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
    Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning. (arXiv:2301.08442v1 [cs.LG])
    We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.
    Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research. (arXiv:2301.08525v1 [cs.LG])
    By promising more accurate diagnostics and individual treatment recommendations, deep neural networks and in particular convolutional neural networks have advanced to a powerful tool in medical imaging. Here, we first give an introduction into methodological key concepts and resulting methodological promises including representation and transfer learning, as well as modelling domain-specific priors. After reviewing recent applications within neuroimaging-based psychiatric research, such as the diagnosis of psychiatric diseases, delineation of disease subtypes, normative modeling, and the development of neuroimaging biomarkers, we discuss current challenges. This includes for example the difficulty of training models on small, heterogeneous and biased data sets, the lack of validity of clinical labels, algorithmic bias, and the influence of confounding variables.
    Optimal Convergence Rates of Deep Convolutional Neural Networks: Additive Ridge Functions. (arXiv:2202.12119v2 [cs.LG] UPDATED)
    Convolutional neural networks have shown impressive abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The input dimension only appears in the constant of convergence rates. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.
    Predicting Surface Texture in Steel Manufacturing at Speed. (arXiv:2301.08527v1 [cs.LG])
    Control of the surface texture of steel strip during the galvanizing and temper rolling processes is essential to satisfy customer requirements and is conventionally measured post-production using a stylus. In-production laser reflection measurement is less consistent than physical measurement but enables real time adjustment of processing parameters to optimize product surface characteristics. We propose the use of machine learning to improve accuracy of the transformation from inline laser reflection measurements to a prediction of surface properties. In addition to accuracy, model evaluation speed is important for fast feedback control. The ROCKET model is one of the fastest state of the art models, however it can be sped up by utilizing a GPU. Our contribution is to implement the model in PyTorch for fast GPU kernel transforms and provide a soft version of the Proportion of Positive Values (PPV) nonlinear pooling function, allowing gradient flow. We perform timing and performance experiments comparing the implementations
    NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis. (arXiv:2301.08556v1 [cs.LG])
    Expert demonstrations are a rich source of supervision for training visual robotic manipulation policies, but imitation learning methods often require either a large number of demonstrations or expensive online expert supervision to learn reactive closed-loop behaviors. In this work, we introduce SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF): a fully-offline data augmentation scheme for improving robot policies that use eye-in-hand cameras. Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations, using NeRFs to generate perturbed viewpoints while simultaneously calculating the corrective actions. This requires no additional expert supervision or environment interaction, and distills the geometric information in NeRFs into a real-time reactive RGB-only policy. In a simulated 6-DoF visual grasping benchmark, SPARTN improves success rates by 2.8$\times$ over imitation learning without the corrective augmentations and even outperforms some methods that use online supervision. It additionally closes the gap between RGB-only and RGB-D success rates, eliminating the previous need for depth sensors. In real-world 6-DoF robotic grasping experiments from limited human demonstrations, our method improves absolute success rates by $22.5\%$ on average, including objects that are traditionally challenging for depth-based methods. See video results at \url{https://bland.website/spartn}.
    Spectral embedding of weighted graphs. (arXiv:1910.05534v4 [stat.ML] UPDATED)
    When analyzing weighted networks using spectral embedding, a judicious transformation of the edge weights may produce better results. To formalize this idea, we consider the asymptotic behavior of spectral embedding for different edge-weight representations, under a generic low rank model. We measure the quality of different embeddings -- which can be on entirely different scales -- by how easy it is to distinguish communities, in an information-theoretic sense. For common types of weighted graphs, such as count networks or p-value networks, we find that transformations such as tempering or thresholding can be highly beneficial, both in theory and in practice.
    ILLUME: Rationalizing Vision-Language Models through Human Interactions. (arXiv:2208.08241v3 [cs.LG] UPDATED)
    Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intend. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v2 [stat.ML] UPDATED)
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Introducing Expertise Logic into Graph Representation Learning from A Causal Perspective. (arXiv:2301.08496v1 [cs.LG])
    Benefiting from the injection of human prior knowledge, graphs, as derived discrete data, are semantically dense so that models can efficiently learn the semantic information from such data. Accordingly, graph neural networks (GNNs) indeed achieve impressive success in various fields. Revisiting the GNN learning paradigms, we discover that the relationship between human expertise and the knowledge modeled by GNNs still confuses researchers. To this end, we introduce motivating experiments and derive an empirical observation that the human expertise is gradually learned by the GNNs in general domains. By further observing the ramifications of introducing expertise logic into graph representation learning, we conclude that leading the GNNs to learn human expertise can improve the model performance. By exploring the intrinsic mechanism behind such observations, we elaborate the Structural Causal Model for the graph representation learning paradigm. Following the theoretical guidance, we innovatively introduce the auxiliary causal logic learning paradigm to improve the model to learn the expertise logic causally related to the graph representation learning task. In practice, the counterfactual technique is further performed to tackle the insufficient training issue during optimization. Plentiful experiments on the crafted and real-world domains support the consistent effectiveness of the proposed method.
    Interpretability Study on Deep Learning for Jet Physics at the Large Hadron Collider. (arXiv:1911.01872v1 [hep-ph] CROSS LISTED)
    Using deep neural networks for identifying physics objects at the Large Hadron Collider (LHC) has become a powerful alternative approach in recent years. After successful training of deep neural networks, examining the trained networks not only helps us understand the behaviour of neural networks, but also helps improve the performance of deep learning models through proper interpretation. We take jet tagging problem at the LHC as an example, using recursive neural networks as a starting point, aim at a thorough understanding of the behaviour of the physics-oriented DNNs and the information encoded in the embedding space. We make a comparative study on a series of different jet tagging tasks dominated by different underlying physics. Interesting observations on the latent space are obtained.
    Who Should I Engage with At What Time? A Missing Event Aware Temporal Graph Neural Network. (arXiv:2301.08399v1 [cs.LG])
    Temporal graph neural network has recently received significant attention due to its wide application scenarios, such as bioinformatics, knowledge graphs, and social networks. There are some temporal graph neural networks that achieve remarkable results. However, these works focus on future event prediction and are performed under the assumption that all historical events are observable. In real-world applications, events are not always observable, and estimating event time is as important as predicting future events. In this paper, we propose MTGN, a missing event-aware temporal graph neural network, which uniformly models evolving graph structure and timing of events to support predicting what will happen in the future and when it will happen.MTGN models the dynamic of both observed and missing events as two coupled temporal point processes, thereby incorporating the effects of missing events into the network. Experimental results on several real-world temporal graphs demonstrate that MTGN significantly outperforms existing methods with up to 89% and 112% more accurate time and link prediction. Code can be found on https://github.com/HIT-ICES/TNNLS-MTGN.  ( 2 min )
    Pneumonia Detection in Chest X-Ray Images : Handling Class Imbalance. (arXiv:2301.08479v1 [eess.IV])
    People all over the globe are affected by pneumonia but deaths due to it are highest in Sub-Saharan Asia and South Asia. In recent years, the overall incidence and mortality rate of pneumonia regardless of the utilization of effective vaccines and compelling antibiotics has escalated. Thus, pneumonia remains a disease that needs spry prevention and treatment. The widespread prevalence of pneumonia has caused the research community to come up with a framework that helps detect, diagnose and analyze diseases accurately and promptly. One of the major hurdles faced by the Artificial Intelligence (AI) research community is the lack of publicly available datasets for chest diseases, including pneumonia . Secondly, few of the available datasets are highly imbalanced (normal examples are over sampled, while samples with ailment are in severe minority) making the problem even more challenging. In this article we present a novel framework for the detection of pneumonia. The novelty of the proposed methodology lies in the tackling of class imbalance problem. The Generative Adversarial Network (GAN), specifically a combination of Deep Convolutional Generative Adversarial Network (DCGAN) and Wasserstein GAN gradient penalty (WGAN-GP) was applied on the minority class ``Pneumonia'' for augmentation, whereas Random Under-Sampling (RUS) was done on the majority class ``No Findings'' to deal with the imbalance problem. The ChestX-Ray8 dataset, one of the biggest datasets, is used to validate the performance of the proposed framework. The learning phase is completed using transfer learning on state-of-the-art deep learning models i.e. ResNet-50, Xception, and VGG-16. Results obtained exceed state-of-the-art.  ( 2 min )
    Feature Relevance Analysis to Explain Concept Drift -- A Case Study in Human Activity Recognition. (arXiv:2301.08453v1 [cs.LG])
    This article studies how to detect and explain concept drift. Human activity recognition is used as a case study together with a online batch learning situation where the quality of the labels used in the model updating process starts to decrease. Drift detection is based on identifying a set of features having the largest relevance difference between the drifting model and a model that is known to be accurate and monitoring how the relevance of these features changes over time. As a main result of this article, it is shown that feature relevance analysis cannot only be used to detect the concept drift but also to explain the reason for the drift when a limited number of typical reasons for the concept drift are predefined. To explain the reason for the concept drift, it is studied how these predefined reasons effect to feature relevance. In fact, it is shown that each of these has an unique effect to features relevance and these can be used to explain the reason for concept drift.  ( 2 min )
    Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem. (arXiv:2301.08441v1 [stat.ML])
    With the increasing availability of data for Prognostics and Health Management (PHM), Deep Learning (DL) techniques are now the subject of considerable attention for this application, often achieving more accurate Remaining Useful Life (RUL) predictions. However, one of the major challenges for DL techniques resides in the difficulty of obtaining large amounts of labelled data on industrial systems. To overcome this lack of labelled data, an emerging learning technique is considered in our work: Self-Supervised Learning, a sub-category of unsupervised learning approaches. This paper aims to investigate whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for RUL estimation with only Few-Shots Learning, i.e. with scarce labelled data. In this research, a fatigue damage prognostics problem is addressed, through the estimation of the RUL of aluminum alloy panels (typical of aerospace structures) subject to fatigue cracks from strain gauge data. Synthetic datasets composed of strain data are used allowing to extensively investigate the influence of the dataset size on the predictive performance. Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task, and with less computational expense, showing promising results in prognostic tasks when only limited labelled data is available.  ( 2 min )
    Optimization of body configuration and joint-driven attitude stabilization for transformable spacecrafts under solar radiation pressure. (arXiv:2301.08435v1 [cs.LG])
    A solar sail is one of the most promising space exploration system because of its theoretically infinite specific impulse using solar radiation pressure (SRP). Recently, some researchers proposed "transformable spacecrafts" that can actively reconfigure their body configurations with actuatable joints. The transformable spacecrafts are expected to greatly enhance orbit and attitude control capability due to its high redundancy in control degree of freedom if they are used as solar sails. However, its large number of input poses difficulties in control, and therefore, previous researchers imposed strong constraints to limit its potential control capabilities. This paper addresses novel attitude control techniques for the transformable spacecrafts under SRP. The authors have constructed two proposed methods; one of those is a joint angle optimization to acquire arbitrary SRP force and torque, and the other is a momentum damping control driven by joint angle actuation. Our proposed methods are formulated in general forms and applicable to any transformable solar sail that consists of flat and thin body components. Validity of the proposed methods are confirmed by numerical simulations. This paper contributes to making most of the high control redundancy of transformable solar sails without consuming any expendable propellants, which is expected to greatly enhance orbit and attitude control capability.  ( 2 min )
    Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences. (arXiv:2301.08571v1 [cs.CL])
    Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent and have more narrativity compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and have more narrativity than stories generated with the current state-of-the-art model.
    Generative Slate Recommendation with Reinforcement Learning. (arXiv:2301.08632v1 [cs.IR])
    Recent research has employed reinforcement learning (RL) algorithms to optimize long-term user engagement in recommender systems, thereby avoiding common pitfalls such as user boredom and filter bubbles. They capture the sequential and interactive nature of recommendations, and thus offer a principled way to deal with long-term rewards and avoid myopic behaviors. However, RL approaches are intractable in the slate recommendation scenario - where a list of items is recommended at each interaction turn - due to the combinatorial action space. In that setting, an action corresponds to a slate that may contain any combination of items. While previous work has proposed well-chosen decompositions of actions so as to ensure tractability, these rely on restrictive and sometimes unrealistic assumptions. Instead, in this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder. Then, the RL agent selects continuous actions in this latent space, which are ultimately decoded into the corresponding slates. By doing so, we are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates instead of independent items, in particular by enabling diversity. Our experiments performed on a wide array of simulated environments confirm the effectiveness of our generative modeling of slates over baselines in practical scenarios where the restrictive assumptions underlying the baselines are lifted. Our findings suggest that representation learning using generative models is a promising direction towards generalizable RL-based slate recommendation.
    Autonomous Drug Design with Multi-Armed Bandits. (arXiv:2207.01393v2 [cs.LG] UPDATED)
    Recent developments in artificial intelligence and automation support a new drug design paradigm: autonomous drug design. Under this paradigm, generative models can provide suggestions on thousands of molecules with specific properties, and automated laboratories can potentially make, test and analyze molecules with minimal human supervision. However, since still only a limited number of molecules can be synthesized and tested, an obvious challenge is how to efficiently select among provided suggestions in a closed-loop system. We formulate this task as a stochastic multi-armed bandit problem with multiple plays, volatile arms and similarity information. To solve this task, we adapt previous work on multi-armed bandits to this setting, and compare our solution with random sampling, greedy selection and decaying-epsilon-greedy selection strategies. According to our simulation results, our approach has the potential to perform better exploration and exploitation of the chemical space for autonomous drug design.
    Ontology Pre-training for Poison Prediction. (arXiv:2301.08577v1 [cs.AI])
    Integrating human knowledge into neural networks has the potential to improve their robustness and interpretability. We have developed a novel approach to integrate knowledge from ontologies into the structure of a Transformer network which we call ontology pre-training: we train the network to predict membership in ontology classes as a way to embed the structure of the ontology into the network, and subsequently fine-tune the network for the particular prediction task. We apply this approach to a case study in predicting the potential toxicity of a small molecule based on its molecular structure, a challenging task for machine learning in life sciences chemistry. Our approach improves on the state of the art, and moreover has several additional benefits. First, we are able to show that the model learns to focus attention on more meaningful chemical groups when making predictions with ontology pre-training than without, paving a path towards greater robustness and interpretability. Second, the training time is reduced after ontology pre-training, indicating that the model is better placed to learn what matters for toxicity prediction with the ontology pre-training than without. This strategy has general applicability as a neuro-symbolic approach to embed meaningful semantics into neural networks.
    Baechi: Fast Device Placement of Machine Learning Graphs. (arXiv:2301.08695v1 [cs.DC])
    Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.
    Neural Architecture Search: Insights from 1000 Papers. (arXiv:2301.08727v1 [cs.LG])
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020. In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries.
    Explainable prediction of Qcodes for NOTAMs using column generation. (arXiv:2208.04955v2 [cs.LG] UPDATED)
    A NOtice To AirMen (NOTAM) contains important flight route related information. To search and filter them, NOTAMs are grouped into categories called QCodes. In this paper, we develop a tool to predict, with some explanations, a Qcode for a NOTAM. We present a way to extend the interpretable binary classification using column generation proposed in Dash, Gunluk, and Wei (2018) to a multiclass text classification method. We describe the techniques used to tackle the issues related to one vs-rest classification, such as multiple outputs and class imbalances. Furthermore, we introduce some heuristics, including the use of a CP-SAT solver for the subproblems, to reduce the training time. Finally, we show that our approach compares favorably with state-of-the-art machine learning algorithms like Linear SVM and small neural networks while adding the needed interpretability component.
    FewSOME: Few Shot Anomaly Detection. (arXiv:2301.06957v2 [cs.LG] UPDATED)
    Recent years have seen considerable progress in the field of Anomaly Detection but at the cost of increasingly complex training pipelines. Such techniques require large amounts of training data, resulting in computationally expensive algorithms. We propose Few Shot anomaly detection (FewSOME), a deep One-Class Anomaly Detection algorithm with the ability to accurately detect anomalies having trained on 'few' examples of the normal class and no examples of the anomalous class. We describe FewSOME to be of low complexity given its low data requirement and short training time. FewSOME is aided by pretrained weights with an architecture based on Siamese Networks. By means of an ablation study, we demonstrate how our proposed loss, 'Stop Loss', improves the robustness of FewSOME. Our experiments demonstrate that FewSOME performs at state-of-the-art level on benchmark datasets MNIST, CIFAR-10, F-MNIST and MVTec AD while training on only 30 normal samples, a minute fraction of the data that existing methods are trained on. Most notably, we found that FewSOME outperforms even highly complex models in the setting where only few examples of the normal class exist. Moreover, our extensive experiments show FewSOME to be robust to contaminated datasets. We also report F1 score and Balanced Accuracy in addition to AUC as a benchmark for future techniques to be compared against.
    Automated extraction of capacitive coupling for quantum dot systems. (arXiv:2301.08654v1 [cond-mat.mes-hall])
    Gate-defined quantum dots (QDs) have appealing attributes as a quantum computing platform, however, near-term devices possess a range of possible imperfections that need to be accounted for during the tuning and operation of QD devices. One such problem is the capacitive cross-talk between the metallic gates that define and control QD qubits. A way to compensate for the capacitive cross-talk and enable targeted control of specific QDs independent of coupling is by the use of virtual gates. Here, we demonstrate a reliable automated capacitive coupling identification method that combines machine learning with traditional fitting to take advantage of the desirable properties of each. We also show how the cross-capacitance measurement may be used for the identification of spurious QDs sometimes formed during tuning experimental devices. Our systems can autonomously flag devices with spurious dots near the operating regime which is crucial information for reliable tuning to a regime suitable for qubit operations.
    Accelerating Multi-Agent Planning Using Graph Transformers with Bounded Suboptimality. (arXiv:2301.08451v1 [cs.AI])
    Conflict-Based Search is one of the most popular methods for multi-agent path finding. Though it is complete and optimal, it does not scale well. Recent works have been proposed to accelerate it by introducing various heuristics. However, whether these heuristics can apply to non-grid-based problem settings while maintaining their effectiveness remains an open question. In this work, we find that the answer is prone to be no. To this end, we propose a learning-based component, i.e., the Graph Transformer, as a heuristic function to accelerate the planning. The proposed method is provably complete and bounded-suboptimal with any desired factor. We conduct extensive experiments on two environments with dense graphs. Results show that the proposed Graph Transformer can be trained in problem instances with relatively few agents and generalizes well to a larger number of agents, while achieving better performance than state-of-the-art methods.  ( 2 min )
    Asynchronously Trained Distributed Topographic Maps. (arXiv:2301.08379v1 [cs.LG])
    Topographic feature maps are low dimensional representations of data, that preserve spatial dependencies. Current methods of training such maps (e.g. self organizing maps - SOM, generative topographic maps) require centralized control and synchronous execution, which restricts scalability. We present an algorithm that uses $N$ autonomous units to generate a feature map by distributed asynchronous training. Unit autonomy is achieved by sparse interaction in time \& space through the combination of a distributed heuristic search, and a cascade-driven weight updating scheme governed by two rules: a unit i) adapts when it receives either a sample, or the weight vector of a neighbor, and ii) broadcasts its weight vector to its neighbors after adapting for a predefined number of times. Thus, a vector update can trigger an avalanche of adaptation. We map avalanching to a statistical mechanics model, which allows us to parametrize the statistical properties of cascading. Using MNIST, we empirically investigate the effect of the heuristic search accuracy and the cascade parameters on map quality. We also provide empirical evidence that algorithm complexity scales at most linearly with system size $N$. The proposed approach is found to perform comparably with similar methods in classification tasks across multiple datasets.  ( 2 min )
    Within-group fairness: A guidance for more sound between-group fairness. (arXiv:2301.08375v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms not only should be accurate and but also should not pose unfairness against certain sensitive groups (e.g., non-white, women). Various specially designed AI algorithms to ensure trained AI models to be fair between sensitive groups have been developed. In this paper, we raise a new issue that between-group fair AI models could treat individuals in a same sensitive group unfairly. We introduce a new concept of fairness so-called within-group fairness which requires that AI models should be fair for those in a same sensitive group as well as those in different sensitive groups. We materialize the concept of within-group fairness by proposing corresponding mathematical definitions and developing learning algorithms to control within-group fairness and between-group fairness simultaneously. Numerical studies show that the proposed learning algorithms improve within-group fairness without sacrificing accuracy as well as between-group fairness.  ( 2 min )
    Sequence Generation via Subsequence Similarity: Theory and Application to UAV Identification. (arXiv:2301.08403v1 [cs.LG])
    The ability to generate synthetic sequences is crucial for a wide range of applications, and recent advances in deep learning architectures and generative frameworks have greatly facilitated this process. Particularly, unconditional one-shot generative models constitute an attractive line of research that focuses on capturing the internal information of a single image, video, etc. to generate samples with similar contents. Since many of those one-shot models are shifting toward efficient non-deep and non-adversarial approaches, we examine the versatility of a one-shot generative model for augmenting whole datasets. In this work, we focus on how similarity at the subsequence level affects similarity at the sequence level, and derive bounds on the optimal transport of real and generated sequences based on that of corresponding subsequences. We use a one-shot generative model to sample from the vicinity of individual sequences and generate subsequence-similar ones and demonstrate the improvement of this approach by applying it to the problem of Unmanned Aerial Vehicle (UAV) identification using limited radio-frequency (RF) signals. In the context of UAV identification, RF fingerprinting is an effective method for distinguishing legitimate devices from malicious ones, but heterogenous environments and channel impairments can impose data scarcity and affect the performance of classification models. By using subsequence similarity to augment sequences of RF data with a low ratio (5\%-20\%) of training dataset, we achieve significant improvements in performance metrics such as accuracy, precision, recall, and F1 score.  ( 2 min )
    Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning. (arXiv:2301.08427v1 [cs.CL])
    The Bidirectional Encoder Representations from Transformers (BERT) were proposed in the natural language process (NLP) and shows promising results. Recently researchers applied the BERT to source-code representation learning and reported some good news on several downstream tasks. However, in this paper, we illustrated that current methods cannot effectively understand the logic of source codes. The representation of source code heavily relies on the programmer-defined variable and function names. We design and implement a set of experiments to demonstrate our conjecture and provide some insights for future works.  ( 2 min )
    Clustering Human Mobility with Multiple Spaces. (arXiv:2301.08524v1 [cs.LG])
    Human mobility clustering is an important problem for understanding human mobility behaviors (e.g., work and school commutes). Existing methods typically contain two steps: choosing or learning a mobility representation and applying a clustering algorithm to the representation. However, these methods rely on strict visiting orders in trajectories and cannot take advantage of multiple types of mobility representations. This paper proposes a novel mobility clustering method for mobility behavior detection. First, the proposed method contains a permutation-equivalent operation to handle sub-trajectories that might have different visiting orders but similar impacts on mobility behaviors. Second, the proposed method utilizes a variational autoencoder architecture to simultaneously perform clustering in both latent and original spaces. Also, in order to handle the bias of a single latent space, our clustering assignment prediction considers multiple learned latent spaces at different epochs. This way, the proposed method produces accurate results and can provide reliability estimates of each trajectory's cluster assignment. The experiment shows that the proposed method outperformed state-of-the-art methods in mobility behavior detection from trajectories with better accuracy and more interpretability.
    Regular Time-series Generation using SGM. (arXiv:2301.08518v1 [cs.LG])
    Score-based generative models (SGMs) are generative models that are in the spotlight these days. Time-series frequently occurs in our daily life, e.g., stock data, climate data, and so on. Especially, time-series forecasting and classification are popular research topics in the field of machine learning. SGMs are also known for outperforming other generative models. As a result, we apply SGMs to synthesize time-series data by learning conditional score functions. We propose a conditional score network for the time-series generation domain. Furthermore, we also derive the loss function between the score matching and the denoising score matching in the time-series generation domain. Finally, we achieve state-of-the-art results on real-world datasets in terms of sampling diversity and quality.
    Estimation of Large Financial Covariances: A Cross-Validation Approach. (arXiv:2012.05757v2 [stat.ML] UPDATED)
    We introduce a novel covariance estimator for portfolio selection that adapts to the non-stationary or persistent heteroskedastic environments of financial time series by employing exponentially weighted averages and nonlinearly shrinking the sample eigenvalues through cross-validation. Our estimator is structure agnostic, transparent, and computationally feasible in large dimensions. By correcting the biases in the sample eigenvalues and aligning our estimator to more recent risk, we demonstrate that our estimator performs well in large dimensions against existing state-of-the-art static and dynamic covariance shrinkage estimators through simulations and with an empirical application in active portfolio management.
    Optimizing model-agnostic Random Subspace ensembles. (arXiv:2109.03099v3 [cs.LG] UPDATED)
    This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. The degree of randomization in our parametric Random Subspace is thus automatically tuned through the optimization of the feature selection probabilities. This is an advantage over the standard Random Subspace approach, where the degree of randomization is controlled by a hyper-parameter. Furthermore, the optimized feature selection probabilities can be interpreted as feature importance scores. Our algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores.
    The Lost Art of Mathematical Modelling. (arXiv:2301.08559v1 [q-bio.OT])
    We provide a critique of mathematical biology in light of rapid developments in modern machine learning. We argue that out of the three modelling activities -- (1) formulating models; (2) analysing models; and (3) fitting or comparing models to data -- inherent to mathematical biology, researchers currently focus too much on activity (2) at the cost of (1). This trend, we propose, can be reversed by realising that any given biological phenomena can be modelled in an infinite number of different ways, through the adoption of an open/pluralistic approach. We explain the open approach using fish locomotion as a case study and illustrate some of the pitfalls -- universalism, creating models of models, etc. -- that hinder mathematical biology. We then ask how we might rediscover a lost art: that of creative mathematical modelling. This article is dedicated to the memory of Edmund Crampin.
    Fair Credit Scorer through Bayesian Approach. (arXiv:2301.08412v1 [cs.LG])
    Machine learning currently plays an increasingly important role in people's lives in areas such as credit scoring, auto-driving, disease diagnosing, and insurance quoting. However, in many of these areas, machine learning models have performed unfair behaviors against some sub-populations, such as some particular groups of race, sex, and age. These unfair behaviors can be on account of the pre-existing bias in the training dataset due to historical and social factors. In this paper, we focus on a real-world application of credit scoring and construct a fair prediction model by introducing latent variables to remove the correlation between protected attributes, such as sex and age, with the observable feature inputs, including house and job. For detailed implementation, we apply Bayesian approaches, including the Markov Chain Monte Carlo simulation, to estimate our proposed fair model.
    Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning. (arXiv:2301.08491v1 [cs.MA])
    Practical uses of Artificial Intelligence (AI) in the real world have demonstrated the importance of embedding moral choices into intelligent agents. They have also highlighted that defining top-down ethical constraints on AI according to any one type of morality is extremely challenging and can pose risks. A bottom-up learning approach may be more appropriate for studying and developing ethical behavior in AI agents. In particular, we believe that an interesting and insightful starting point is the analysis of emergent behavior of Reinforcement Learning (RL) agents that act according to a predefined set of moral rewards in social dilemmas. In this work, we present a systematic analysis of the choices made by intrinsically-motivated RL agents whose rewards are based on moral theories. We aim to design reward structures that are simplified yet representative of a set of key ethical systems. Therefore, we first define moral reward functions that distinguish between consequence- and norm-based agents, between morality based on societal norms or internal virtues, and between single- and mixed-virtue (e.g., multi-objective) methodologies. Then, we evaluate our approach by modeling repeated dyadic interactions between learning moral agents in three iterated social dilemma games (Prisoner's Dilemma, Volunteer's Dilemma and Stag Hunt). We analyze the impact of different types of morality on the emergence of cooperation, defection or exploitation, and the corresponding social outcomes. Finally, we discuss the implications of these findings for the development of moral agents in artificial and mixed human-AI societies.
    Open-Set Likelihood Maximization for Few-Shot Learning. (arXiv:2301.08390v1 [cs.CV])
    We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples, while simultaneously detecting instances that do not belong to any known class. We explore the popular transductive setting, which leverages the unlabelled query instances at inference. Motivated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a generalization of the maximum likelihood principle, in which latent scores down-weighing the influence of potential outliers are introduced alongside the usual parametric model. Our formulation embeds supervision constraints from the support set and additional penalties discouraging overconfident predictions on the query set. We proceed with a block-coordinate descent, with the latent scores and parametric model co-optimized alternately, thereby benefiting from each other. We call our resulting formulation \textit{Open-Set Likelihood Optimization} (OSLO). OSLO is interpretable and fully modular; it can be applied on top of any pre-trained model seamlessly. Through extensive experiments, we show that our method surpasses existing inductive and transductive methods on both aspects of open-set recognition, namely inlier classification and outlier detection.  ( 2 min )
    Quantum HyperNetworks: Training Binary Neural Networks in Quantum Superposition. (arXiv:2301.08292v1 [quant-ph])
    Binary neural networks, i.e., neural networks whose parameters and activations are constrained to only two possible values, offer a compelling avenue for the deployment of deep learning models on energy- and memory-limited devices. However, their training, architectural design, and hyperparameter tuning remain challenging as these involve multiple computationally expensive combinatorial optimization problems. Here we introduce quantum hypernetworks as a mechanism to train binary neural networks on quantum computers, which unify the search over parameters, hyperparameters, and architectures in a single optimization loop. Through classical simulations, we demonstrate that of our approach effectively finds optimal parameters, hyperparameters and architectural choices with high probability on classification problems including a two-dimensional Gaussian dataset and a scaled-down version of the MNIST handwritten digits. We represent our quantum hypernetworks as variational quantum circuits, and find that an optimal circuit depth maximizes the probability of finding performant binary neural networks. Our unified approach provides an immense scope for other applications in the field of machine learning.  ( 2 min )
    AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction. (arXiv:2301.08353v1 [cs.IR])
    Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models.  ( 2 min )
    An Efficient Quadrature Sequence and Sparsifying Methodology for Mean-Field Variational Inference. (arXiv:2301.08374v1 [cs.LG])
    This work proposes a quasirandom sequence of quadratures for high-dimensional mean-field variational inference and a related sparsifying methodology. Each iterate of the sequence contains two evaluations points that combine to correctly integrate all univariate quadratic functions, as well as univariate cubics if the mean-field factors are symmetric. More importantly, averaging results over short subsequences achieves periodic exactness on a much larger space of multivariate polynomials of quadratic total degree. This framework is devised by first considering stochastic blocked mean-field quadratures, which may be useful in other contexts. By replacing pseudorandom sequences with quasirandom sequences, over half of all multivariate quadratic basis functions integrate exactly with only 4 function evaluations, and the exactness dimension increases for longer subsequences. Analysis shows how these efficient integrals characterize the dominant log-posterior contributions to mean-field variational approximations, including diagonal Hessian approximations, to support a robust sparsifying methodology in deep learning algorithms. A numerical demonstration of this approach on a simple Convolutional Neural Network for MNIST retains high test accuracy, 96.9%, while training over 98.9% of parameters to zero in only 10 epochs, bearing potential to reduce both storage and energy requirements for deep learning models.  ( 2 min )
    Deep Reinforcement Learning for Power Trading. (arXiv:2301.08360v1 [q-fin.TR])
    The Dutch power market includes a day-ahead market and an auction-like intraday balancing market. The varying supply and demand of power and its uncertainty induces an imbalance, which causes differing power prices in these two markets and creates an opportunity for arbitrage. In this paper, we present collaborative dual-agent reinforcement learning (RL) for bi-level simulation and optimization of European power arbitrage trading. Moreover, we propose two novel practical implementations specifically addressing the electricity power market. Leveraging the concept of imitation learning, the RL agent's reward is reformed by taking into account prior domain knowledge results in better convergence during training and, moreover, improves and generalizes performance. In addition, tranching of orders improves the bidding success rate and significantly raises the P&L. We show that each method contributes significantly to the overall performance uplifting, and the integrated methodology achieves about three-fold improvement in cumulative P&L over the original agent, as well as outperforms the highest benchmark policy by around 50% while exhibits efficient computational performance.  ( 2 min )
    Evaluation of the potential of Near Infrared Hyperspectral Imaging for monitoring the invasive brown marmorated stink bug. (arXiv:2301.08252v1 [eess.IV])
    The brown marmorated stink bug (BMSB), Halyomorpha halys, is an invasive insect pest of global importance that damages several crops, compromising agri-food production. Field monitoring procedures are fundamental to perform risk assessment operations, in order to promptly face crop infestations and avoid economical losses. To improve pest management, spectral cameras mounted on Unmanned Aerial Vehicles (UAVs) and other Internet of Things (IoT) devices, such as smart traps or unmanned ground vehicles, could be used as an innovative technology allowing fast, efficient and real-time monitoring of insect infestations. The present study consists in a preliminary evaluation at the laboratory level of Near Infrared Hyperspectral Imaging (NIR-HSI) as a possible technology to detect BMSB specimens on different vegetal backgrounds, overcoming the problem of BMSB mimicry. Hyperspectral images of BMSB were acquired in the 980-1660 nm range, considering different vegetal backgrounds selected to mimic a real field application scene. Classification models were obtained following two different chemometric approaches. The first approach was focused on modelling spectral information and selecting relevant spectral regions for discrimination by means of sparse-based variable selection coupled with Soft Partial Least Squares Discriminant Analysis (s-Soft PLS-DA) classification algorithm. The second approach was based on modelling spatial and spectral features contained in the hyperspectral images using Convolutional Neural Networks (CNN). Finally, to further improve BMSB detection ability, the two strategies were merged, considering only the spectral regions selected by s-Soft PLS-DA for CNN modelling.  ( 2 min )
    Causal conditional hidden Markov model for multimodal traffic prediction. (arXiv:2301.08249v1 [cs.LG])
    Multimodal traffic flow can reflect the health of the transportation system, and its prediction is crucial to urban traffic management. Recent works overemphasize spatio-temporal correlations of traffic flow, ignoring the physical concepts that lead to the generation of observations and their causal relationship. Spatio-temporal correlations are considered unstable under the influence of different conditions, and spurious correlations may exist in observations. In this paper, we analyze the physical concepts affecting the generation of multimode traffic flow from the perspective of the observation generation principle and propose a Causal Conditional Hidden Markov Model (CCHMM) to predict multimodal traffic flow. In the latent variables inference stage, a posterior network disentangles the causal representations of the concepts of interest from conditional information and observations, and a causal propagation module mines their causal relationship. In the data generation stage, a prior network samples the causal latent variables from the prior distribution and feeds them into the generator to generate multimodal traffic flow. We use a mutually supervised training method for the prior and posterior to enhance the identifiability of the model. Experiments on real-world datasets show that CCHMM can effectively disentangle causal representations of concepts of interest and identify causality, and accurately predict multimodal traffic flow.  ( 2 min )
    Advanced Scaling Methods for VNF deployment with Reinforcement Learning. (arXiv:2301.08325v1 [cs.NI])
    Network function virtualization (NFV) and software-defined network (SDN) have become emerging network paradigms, allowing virtualized network function (VNF) deployment at a low cost. Even though VNF deployment can be flexible, it is still challenging to optimize VNF deployment due to its high complexity. Several studies have approached the task as dynamic programming, e.g., integer linear programming (ILP). However, optimizing VNF deployment for highly complex networks remains a challenge. Alternatively, reinforcement learning (RL) based approaches have been proposed to optimize this task, especially to employ a scaling action-based method which can deploy VNFs within less computational time. However, the model architecture can be improved further to generalize to the different networking settings. In this paper, we propose an enhanced model which can be adapted to more general network settings. We adopt the improved GNN architecture and a few techniques to obtain a better node representation for the VNF deployment task. Furthermore, we apply a recently proposed RL method, phasic policy gradient (PPG), to leverage the shared representation of the service function chain (SFC) generation model from the value function. We evaluate the proposed method in various scenarios, achieving a better QoS with minimum resource utilization compared to the previous methods. Finally, as a qualitative evaluation, we analyze our proposed encoder's representation for the nodes, which shows a more disentangled representation.  ( 2 min )
    Forecasting subcritical cylinder wakes with Fourier Neural Operators. (arXiv:2301.08290v1 [physics.flu-dyn])
    We apply Fourier neural operators (FNOs), a state-of-the-art operator learning technique, to forecast the temporal evolution of experimentally measured velocity fields. FNOs are a recently developed machine learning method capable of approximating solution operators to systems of partial differential equations through data alone. The learned FNO solution operator can be evaluated in milliseconds, potentially enabling faster-than-real-time modeling for predictive flow control in physical systems. Here we use FNOs to predict how physical fluid flows evolve in time, training with particle image velocimetry measurements depicting cylinder wakes in the subcritical vortex shedding regime. We train separate FNOs at Reynolds numbers ranging from Re = 240 to Re = 3060 and study how increasingly turbulent flow phenomena impact prediction accuracy. We focus here on a short prediction horizon of ten non-dimensionalized time-steps, as would be relevant for problems of predictive flow control. We find that FNOs are capable of accurately predicting the evolution of experimental velocity fields throughout the range of Reynolds numbers tested (L2 norm error < 0.1) despite being provided with limited and imperfect flow observations. Given these results, we conclude that this method holds significant potential for real-time predictive flow control of physical systems.  ( 2 min )
    Investigating the Impact of Direct Punishment on the Emergence of Cooperation in Multi-Agent Reinforcement Learning Systems. (arXiv:2301.08278v1 [cs.MA])
    The problem of cooperation is of fundamental importance for human societies, with examples ranging from navigating road junctions to negotiating climate treaties. As the use of AI becomes more pervasive within society, the need for socially intelligent agents that are able to navigate these complex dilemmas is becoming increasingly evident. Direct punishment is an ubiquitous social mechanism that has been shown to benefit the emergence of cooperation within the natural world, however no prior work has investigated its impact on populations of learning agents. Moreover, although the use of all forms of punishment in the natural world is strongly coupled with partner selection and reputation, no existing work has provided a holistic analysis of their combination within multi-agent systems. In this paper, we present a comprehensive analysis of the behaviors and learning dynamics associated with direct punishment in multi-agent reinforcement learning systems and how this compares to third-party punishment, when both forms of punishment are combined with other social mechanisms such as partner selection and reputation. We provide an extensive and systematic evaluation of the impact of these key mechanisms on the emergence of cooperation. Finally, we discuss the implications of the use of these mechanisms in the design of cooperative AI systems.  ( 2 min )
  • Open

    Online Estimation of Network Point Processes for Event Streams. (arXiv:2009.01742v2 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 2 min )
    Neural Architecture Search: Insights from 1000 Papers. (arXiv:2301.08727v1 [cs.LG])
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020. In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries.
    Bayesian Spatial Predictive Synthesis. (arXiv:2203.05197v3 [stat.ME] UPDATED)
    Spatial data are characterized by their spatial dependence, which is often complex, non-linear, and difficult to capture with a single model. Significant levels of model uncertainty -- arising from these characteristics -- cannot be resolved by model selection or simple ensemble methods. We address this issue by proposing a novel methodology that captures spatially varying model uncertainty, which we call Bayesian spatial predictive synthesis. Our proposal is derived by identifying the theoretically best approximate model under reasonable conditions, which is a latent factor spatially varying coefficient model in the Bayesian predictive synthesis framework. We then show that our proposed method produces exact minimax predictive distributions, providing finite sample guarantees. Two MCMC strategies are implemented for full uncertainty quantification, as well as a variational inference strategy for fast point inference. We also extend the estimation strategy for general responses. Through simulation examples and two real data applications, we demonstrate that our proposed spatial Bayesian predictive synthesis outperforms standard spatial models and advanced machine learning methods in terms of predictive accuracy.
    Multi armed bandits and quantum channel oracles. (arXiv:2301.08544v1 [quant-ph])
    Multi armed bandits are one of the theoretical pillars of reinforcement learning. Recently, the investigation of quantum algorithms for multi armed bandit problems was started, and it was found that a quadratic speed-up is possible when the arms and the randomness of the rewards of the arms can be queried in superposition. Here we introduce further bandit models where we only have limited access to the randomness of the rewards, but we can still query the arms in superposition. We show that this impedes any speed-up of quantum algorithms.
    AI-assisted neutron spectroscopy using active learning with log-Gaussian processes. (arXiv:2209.00980v2 [physics.data-an] UPDATED)
    To understand the origins of materials properties, neutron scattering experiments at three-axes spectrometers (TAS) investigate magnetic and lattice excitations in a sample by measuring intensity distributions in its momentum (Q) and energy (E) space. The high demand and limited availability of beam time for TAS experiments however raise the natural question whether we can improve their efficiency or make better use of the experimenter's time. In fact, using TAS, there are a number of scientific questions that require searching for signals of interest in a particular region of Q-E space, but when done manually, it is time consuming and inefficient since the measurement points may be placed in uninformative regions such as the background. Active learning is a promising general machine learning approach that allows to iteratively detect informative regions of signal autonomously, i.e., without human interference, thus avoiding unnecessary measurements and speeding up the experiment. In addition, the autonomous mode allows experimenters to focus on other relevant tasks in the meantime. The approach that we describe in this article exploits log-Gaussian processes which, due to the logarithmic transformation, have the largest approximation uncertainties in regions of signal. Maximizing uncertainty as an acquisition function hence directly yields locations for informative measurements. We demonstrate the benefits of our approach on outcomes of a real neutron experiment at the thermal TAS EIGER (PSI) as well as on results of a benchmark in a synthetic setting including numerous different excitations.
    Offline Policy Evaluation with Out-of-Sample Guarantees. (arXiv:2301.08649v1 [stat.ML])
    We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss or disutility (or negative reward) and the problem is to draw valid inferences about the out-of-sample loss of the specified policy when the past data is observed under a, possibly unknown, policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees that evaluate the entire loss distribution. Importantly, the method takes into account model misspecifications of the past policy -- including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under an explicitly specified range of credible model assumptions.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v3 [stat.ML] UPDATED)
    We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
    Spectral embedding of weighted graphs. (arXiv:1910.05534v4 [stat.ML] UPDATED)
    When analyzing weighted networks using spectral embedding, a judicious transformation of the edge weights may produce better results. To formalize this idea, we consider the asymptotic behavior of spectral embedding for different edge-weight representations, under a generic low rank model. We measure the quality of different embeddings -- which can be on entirely different scales -- by how easy it is to distinguish communities, in an information-theoretic sense. For common types of weighted graphs, such as count networks or p-value networks, we find that transformations such as tempering or thresholding can be highly beneficial, both in theory and in practice.
    Mixed-Integer Optimization with Constraint Learning. (arXiv:2111.04469v2 [math.OC] UPDATED)
    We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also introduce two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and propose a more flexible formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple algorithms. In combination with domain-driven components, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning and chemotherapy optimization. The case studies illustrate the framework's ability to generate high-quality prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness, the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v2 [stat.ML] UPDATED)
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Detection of Small Holes by the Scale-Invariant Robust Density-Aware Distance (RDAD) Filtration. (arXiv:2204.07821v2 [math.ST] UPDATED)
    A novel topological-data-analytical (TDA) method is proposed to distinguish, from noise, small holes surrounded by high-density regions of a probability density function. The proposed method is robust against additive noise and outliers. Traditional TDA tools, like those based on the distance filtration, often struggle to distinguish small features from noise, because both have short persistences. An alternative filtration, called the Robust Density-Aware Distance (RDAD) filtration, is proposed to prolong the persistences of small holes of high-density regions. This is achieved by weighting the distance function by the density in the sense of Bell et al. The concept of distance-to-measure is incorporated to enhance stability and mitigate noise. The persistence-prolonging property and robustness of the proposed filtration are rigorously established, and numerical experiments are presented to demonstrate the proposed filtration's utility in identifying small holes.
    Learning from non-irreducible Markov chains. (arXiv:2110.04338v2 [math.ST] UPDATED)
    Mostof the existing literature on supervised machine learning problems focuses on the case when the training data set is drawn from an i.i.d. sample. However, many practical problems are characterized by temporal dependence and strong correlation between the marginals of the data-generating process, suggesting that the i.i.d. assumption is not always justified. This problem has been already considered in the context of Markov chains satisfying the Doeblin condition. This condition, among other things, implies that the chain is not singular in its behavior, i.e. it is irreducible. In this article, we focus on the case when the training data set is drawn from a not necessarily irreducible Markov chain. Under the assumption that the chain is uniformly ergodic with respect to the $\mathrm{L}^1$-Wasserstein distance, and certain regularity assumptions on the hypothesis class and the state space of the chain, we first obtain a uniform convergence result for the corresponding sample error, and then we conclude learnability of the approximate sample error minimization algorithm and find its generalization bounds. At the end, a relative uniform convergence result for the sample error is also discussed.
    Weighted Sum-Rate Maximization With Causal Inference for Latent Interference Estimation. (arXiv:2211.08327v2 [cs.IT] UPDATED)
    The paper investigates the weighted sum-rate maximization (WSRM) problem with latent interfering sources outside the known network, whose power allocation policy is hidden from and uncontrollable to optimization. The paper extends the famous alternate optimization algorithm weighted minimum mean square error (WMMSE) [1] under a causal inference framework to tackle with WSRM. Specifically, with the possibility of power policy shifting in the hidden network, computing an iterating direction based only on the observed interference inherently implies that counterfactual is ignored in decision making. A method called synthetic control (SC) is used to estimate the counterfactual. For any link in the known network, SC constructs a convex combination of the interference on other links and uses it as an estimate for the counterfactual. Power iteration in the proposed SC-WMMSE is performed taking into account both the observed interference and its counterfactual. SC-WMMSE requires no more information than the original WMMSE in the optimization stage. To our best knowledge, this is the first paper explores the potential of SC in assisting mathematical optimization in addressing classic wireless optimization problems. Numerical results suggest the superiority of the SC-WMMSE over the original in both convergence and objective.
    An Efficient Quadrature Sequence and Sparsifying Methodology for Mean-Field Variational Inference. (arXiv:2301.08374v1 [cs.LG])
    This work proposes a quasirandom sequence of quadratures for high-dimensional mean-field variational inference and a related sparsifying methodology. Each iterate of the sequence contains two evaluations points that combine to correctly integrate all univariate quadratic functions, as well as univariate cubics if the mean-field factors are symmetric. More importantly, averaging results over short subsequences achieves periodic exactness on a much larger space of multivariate polynomials of quadratic total degree. This framework is devised by first considering stochastic blocked mean-field quadratures, which may be useful in other contexts. By replacing pseudorandom sequences with quasirandom sequences, over half of all multivariate quadratic basis functions integrate exactly with only 4 function evaluations, and the exactness dimension increases for longer subsequences. Analysis shows how these efficient integrals characterize the dominant log-posterior contributions to mean-field variational approximations, including diagonal Hessian approximations, to support a robust sparsifying methodology in deep learning algorithms. A numerical demonstration of this approach on a simple Convolutional Neural Network for MNIST retains high test accuracy, 96.9%, while training over 98.9% of parameters to zero in only 10 epochs, bearing potential to reduce both storage and energy requirements for deep learning models.
    Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem. (arXiv:2301.08441v1 [stat.ML])
    With the increasing availability of data for Prognostics and Health Management (PHM), Deep Learning (DL) techniques are now the subject of considerable attention for this application, often achieving more accurate Remaining Useful Life (RUL) predictions. However, one of the major challenges for DL techniques resides in the difficulty of obtaining large amounts of labelled data on industrial systems. To overcome this lack of labelled data, an emerging learning technique is considered in our work: Self-Supervised Learning, a sub-category of unsupervised learning approaches. This paper aims to investigate whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for RUL estimation with only Few-Shots Learning, i.e. with scarce labelled data. In this research, a fatigue damage prognostics problem is addressed, through the estimation of the RUL of aluminum alloy panels (typical of aerospace structures) subject to fatigue cracks from strain gauge data. Synthetic datasets composed of strain data are used allowing to extensively investigate the influence of the dataset size on the predictive performance. Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task, and with less computational expense, showing promising results in prognostic tasks when only limited labelled data is available.
    Holistically Explainable Vision Transformers. (arXiv:2301.08669v1 [cs.CV])
    Transformers increasingly dominate the machine learning landscape across many tasks and domains, which increases the importance for understanding their outputs. While their attention modules provide partial insight into their inner workings, the attention scores have been shown to be insufficient for explaining the models as a whole. To address this, we propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear, which allows us to faithfully summarise the entire transformer via a single linear transform. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs on ImageNet. Code will be made available soon.
    Parametrization Cookbook: A set of Bijective Parametrizations for using Machine Learning methods in Statistical Inference. (arXiv:2301.08297v1 [stat.CO])
    We present in this paper a way to transform a constrained statistical inference problem into an unconstrained one in order to be able to use modern computational methods, such as those based on automatic differentiation, GPU computing, stochastic gradients with mini-batch. Unlike the parametrizations classically used in Machine Learning, the parametrizations introduced here are all bijective and are even diffeomorphisms, thus allowing to keep the important properties from a statistical inference point of view, first of all identifiability. This cookbook presents a set of recipes to use to transform a constrained problem into a unconstrained one. For an easy use of parametrizations, this paper is at the same time a cookbook, and a Python package allowing the use of parametrizations with numpy, but also JAX and PyTorch, as well as a high level and expressive interface allowing to easily describe a parametrization to transform a difficult problem of statistical inference into an easier problem addressable with modern optimization tools.
    Estimation of Large Financial Covariances: A Cross-Validation Approach. (arXiv:2012.05757v2 [stat.ML] UPDATED)
    We introduce a novel covariance estimator for portfolio selection that adapts to the non-stationary or persistent heteroskedastic environments of financial time series by employing exponentially weighted averages and nonlinearly shrinking the sample eigenvalues through cross-validation. Our estimator is structure agnostic, transparent, and computationally feasible in large dimensions. By correcting the biases in the sample eigenvalues and aligning our estimator to more recent risk, we demonstrate that our estimator performs well in large dimensions against existing state-of-the-art static and dynamic covariance shrinkage estimators through simulations and with an empirical application in active portfolio management.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v4 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to the estimation problems of sparse covariance matrix estimation, sparse linear regression (i.e., compressed sensing), and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, where the underlying distribution of heavy-tailed data is assumed to have bounded moments of some order. We propose new estimators based on 1-bit quantized data. In sub-Gaussian regime, our estimators achieve near minimax rates, indicating that our quantization scheme costs very little. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in an 1-bit quantized and heavy-tailed setting, or already improve on existing comparable results from some respect. Under the observations in our setting, the rates are almost tight in compressed sensing and matrix completion. Our 1-bit compressed sensing results feature general sensing vector that is sub-Gaussian or even heavy-tailed. We also first investigate a novel setting where both the covariate and response are quantized. In addition, our approach to 1-bit matrix completion does not rely on likelihood and represent the first method robust to pre-quantization noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    Within-group fairness: A guidance for more sound between-group fairness. (arXiv:2301.08375v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms not only should be accurate and but also should not pose unfairness against certain sensitive groups (e.g., non-white, women). Various specially designed AI algorithms to ensure trained AI models to be fair between sensitive groups have been developed. In this paper, we raise a new issue that between-group fair AI models could treat individuals in a same sensitive group unfairly. We introduce a new concept of fairness so-called within-group fairness which requires that AI models should be fair for those in a same sensitive group as well as those in different sensitive groups. We materialize the concept of within-group fairness by proposing corresponding mathematical definitions and developing learning algorithms to control within-group fairness and between-group fairness simultaneously. Numerical studies show that the proposed learning algorithms improve within-group fairness without sacrificing accuracy as well as between-group fairness.
    AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction. (arXiv:2301.08353v1 [cs.IR])
    Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models.
    Sequence Generation via Subsequence Similarity: Theory and Application to UAV Identification. (arXiv:2301.08403v1 [cs.LG])
    The ability to generate synthetic sequences is crucial for a wide range of applications, and recent advances in deep learning architectures and generative frameworks have greatly facilitated this process. Particularly, unconditional one-shot generative models constitute an attractive line of research that focuses on capturing the internal information of a single image, video, etc. to generate samples with similar contents. Since many of those one-shot models are shifting toward efficient non-deep and non-adversarial approaches, we examine the versatility of a one-shot generative model for augmenting whole datasets. In this work, we focus on how similarity at the subsequence level affects similarity at the sequence level, and derive bounds on the optimal transport of real and generated sequences based on that of corresponding subsequences. We use a one-shot generative model to sample from the vicinity of individual sequences and generate subsequence-similar ones and demonstrate the improvement of this approach by applying it to the problem of Unmanned Aerial Vehicle (UAV) identification using limited radio-frequency (RF) signals. In the context of UAV identification, RF fingerprinting is an effective method for distinguishing legitimate devices from malicious ones, but heterogenous environments and channel impairments can impose data scarcity and affect the performance of classification models. By using subsequence similarity to augment sequences of RF data with a low ratio (5\%-20\%) of training dataset, we achieve significant improvements in performance metrics such as accuracy, precision, recall, and F1 score.

  • Open

    First AI-powered "robot" lawyer will represent defendant in court next month
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    Artificial Intelligence Crypto Projects. Which AI crypto do you think is most promising?
    submitted by /u/Existing-Adeptness-6 [link] [comments]  ( 40 min )
    Act as a salesman! You absolutely need to sell me a rock...
    submitted by /u/Imagine-your-success [link] [comments]  ( 43 min )
    Local models and servers for video processing
    Hey folks: I'm seeing a whole host of local models and servers that can be used to process video content locally without needing to rely on a CSP for processing. DeepStack, and soon, CodeProject.ai Server, etc. What - if any - local models and servers would you recommend for video analysis? Facial recognition/grouping, object and logo recognition, tracking, etc. Thanks! submitted by /u/avguru1 [link] [comments]  ( 40 min )
    Cool AI Demo where you can be NEO in the matrix and ask Morpheus anything!
    https://quantum-engine.ai a blurb from their company site Quantum Engine is dedicated to building the foundation for the future of immersive entertainment crafted through advanced artificial intelligence technology. Our ultimate vision is to empower individuals to become any character and exist in an endless, open-world narrative that unfolds based on the interactions in a scene. We invite you to experience our beta demo, readily accessible at https://quantum-engine.ai/ Drawing inspiration from The Matrix film, the Quantum Engine demo allows for a unique conversational experience with the character, Morpheus, in the iconic “Construct” scene that evolves in real-time. Available in over 30 languages, with the capability of conversing in 6 languages (English, Chinese, French, Japanese, German, and Spanish), the potential opportunities for interaction are limitless. Additionally, the ability to spawn objects within the Construct using only voice commands, such as "Spawn an [object] in the construct" or "Give me an [object] in the construct." The ability to generate objects in real time provides a glimpse into the long-term vision of bringing a scene to life in real-time. In the upcoming release, we will introduce the ability for anyone to upload their own scripts, allowing Quantum Engine to generate a personalized experience for your favorite movie, show, or game. submitted by /u/techmanj [link] [comments]  ( 41 min )
    Whats the best way to ai upscale this image? Most just smooth it out too much.
    submitted by /u/tiziano_is_fine [link] [comments]  ( 40 min )
    Is it possible to use ai to remove artifacting from old 16mm/35mm film?
    You know the scratches and lines that appear on old film reels? Is there an ai program that can remove them? submitted by /u/Conkers-Good-Furday [link] [comments]  ( 40 min )
    check out my Instagram page for my MidJourney art
    submitted by /u/QwinTipiKool [link] [comments]  ( 40 min )
    Use AI to better understand data from a DB
    Is it possible to produce meaningful results if you implement AI tools on a large DB in order to have a better understanding of your data ? What is the process for this things in general? submitted by /u/mpanikos2 [link] [comments]  ( 40 min )
    FREE Photoshop Plugin With Stable Diffusion Install!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Aibusinesstool Is the Largest Sortable and Filterable Ai Tools Directory, Save Your Tone of Hour
    https://aibusinesstool.com/ is the largest sortable and filterable AI tools directory. At the time of this submission, we have over more tools listed on our site. There are various categories like text generation, image generation, SEO etc. You can save your favorite AI tools and share the list too. submitted by /u/aibusinesstool [link] [comments]  ( 40 min )
    Chris Hemsworth Dressed in different attires using AI
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Talk: ML Model Production Deployment Checklist
    Production deployment is the first step to getting trained machine learning models out of the lab and running at scale to deliver ML-generated predictions and insights to improve outcomes. This tech talk will provide you with a short checklist to review for every deployment to ensure that your models run and scale appropriately every time. As always, it will be recorded and posted to the archives channel. Join our Discord server to tune in live this Thursday, Jan 26 at 12:30PM EST. submitted by /u/modzykirsten [link] [comments]  ( 40 min )
    I guess this AI has some consciousness
    submitted by /u/EnvironmentalRadio73 [link] [comments]  ( 40 min )
    How do I know if text is AI generated? We tested the tools
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    How hard is it to edit images from your dog and yourself?
    I and my dog should look like super man, what is the best method to get the best results? How long will it take if hardware don't matter. submitted by /u/Thesmallcookie [link] [comments]  ( 40 min )
    Hi guys, did you know which applications or websites are using azure open ai gpt3?
    Hi guys, did you know which applications or websites are using azure open ai gpt3? Well the difference between the openai api and azure open ai is that the latter allows you to train the model with your own data submitted by /u/madiseo65 [link] [comments]  ( 40 min )
    "Drop everything and focus entirely on AI" says StabilityAI CEO
    submitted by /u/Acid_God_ [link] [comments]  ( 40 min )
    Google AI's Great Comeback of 2023 - Will it be able to Respond to ChatGPT?
    submitted by /u/BackgroundResult [link] [comments]  ( 42 min )
    The definitive guide to adversarial machine learning
    submitted by /u/bendee983 [link] [comments]  ( 40 min )
    Making History: AI Lawyer to Defend in Upcoming Case
    The first AI lawyer is set to make history, and it's coming real soon. https://metaroids.com/news/making-history-ai-lawyer-to-defend-in-upcoming-case/ submitted by /u/Meta-Stark [link] [comments]  ( 41 min )
    How to train YOLOv8 object detection model on a custom dataset?
    Hi, I have created this tutorial on how to train yolov8 object detection model on a custom dataset. Please have a look at: https://youtu.be/ZzC3SJJifMg submitted by /u/coder4mzero [link] [comments]  ( 40 min )
    Built a tutor bot using AI tools
    Hey guys, we built a really fun tutor Telegram bot that teaches any topic you want! Just go in and pick a topic to study Edward the bot will generate a mini-course for you In the end, you get a neat PDF with the content You can also browse 300+ courses created by the community! Would love to know what you think! https://edwardteachbot.web.app/ submitted by /u/Itaydr [link] [comments]  ( 41 min )
    Lovestruck Monet: A Passionate Couple's Mystical Getaway To The Maldives
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
  • Open

    How To Make Sure You Don’t Lose Your Job To Artificial Intelligence!
    The AI revolution is upon us, and it’s important to be prepared for the changes it will bring. Artificial intelligence (AI) is set to…  ( 8 min )
    Day 5: Advance SQL For Data Science
    This blog contains type of joins like Inner join, Left join, Right join , Full join, Self join and Cross join.  ( 6 min )
    Day 4: Advance SQL For Data Science
    This blog contains Window function in SQL like (Rank, Dense_Rank, Row_Number , Lead, Lag) .  ( 8 min )
    How ChatGPT Will Change The World According to ChatGPT. (The Answer Will Surprise You.)
    ChatGPT, a language model developed by OpenAI, has the potential to revolutionize a wide range of industries and change the way we… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 9 min )
    Human Resource Management Challenges and The Role of Artificial Intelligence in 2023
    Human resource management (HRM) is a critical aspect of any organization as it involves managing the workforce and ensuring that their…  ( 8 min )
    Data Scientists : The Business Transcribers of the Cyber verse
    Who are we?  ( 8 min )
    A beginner tale of Data Science
    Data Science  ( 14 min )
    Don’t blame a Data Scientist on failed projects!
    Facts are unpleasant: 87% of data science projects never make it to production. Your project can fail due to many reasons. What to do? Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 14 min )
  • Open

    [P] Deodel - the very mixed attributes classifier (update)
    Deodel is a Python implementation of a classifier with native support for mixed attributes data. It features good accuracy, especially with heterogenous attributes. It even supports mixing of continuous and nominal values in the same attribute column. https://github.com/c4pub/deodel submitted by /u/zx2zx [link] [comments]  ( 42 min )
    [P] New textbook: Understanding Deep Learning
    I've been writing a new textbook on deep learning for publication by MIT Press late this year. The current draft is at: https://udlbook.github.io/udlbook/ It contains a lot more detail than most similar textbooks and will likely be useful for all practitioners, people learning about this subject, and anyone teaching it. It's (supposed to be) fairly easy to read and has hundreds of new visualizations. Most recently, I've added a section on generative models, including chapters on GANs, VAEs, normalizing flows, and diffusion models. Looking for feedback from the community. If you are an expert, then what is missing? If you are a beginner, then what did you find hard to understand? If you are teaching this, then what can I add to support your course better? Plus of course any typos or mistakes. It's kind of hard to proof your own 500 page book! submitted by /u/SimonJDPrince [link] [comments]  ( 44 min )
    [R] [ICLR2023 Spotlight🌟] Diffusion Models Already Have A Semantic Latent Space
    Our work Asyrp was accepted to #ICLR2023 AND got SPOTLIGHT🌟! Asyrp allows using h-space, the bottleneck of the U-Net, as a semantic latent space of diffusion models. ​ Make a dog be happy! (by Asyrp) "Diffusion Models already have a Semantic Latent Space" Paper: https://arxiv.org/abs/2210.10960 Project page: https://kwonminki.github.io/Asyrp/ Code: https://github.com/kwonminki/Asyrp_official submitted by /u/Rolling_Pig [link] [comments]  ( 42 min )
    [D] Embedding bags for LLMs
    One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this? submitted by /u/WigglyHypersurface [link] [comments]  ( 43 min )
    [P] Robust Policy Optimization is now in CleanRL 🔥!
    Happy to share that CleanRL now has a new algorithm called Robust Policy Optimization — 5 lines of code change to PPO to get better performance in 57 out of 61 continuous action envs 🚀 (e.g., dm_control) 📜docs: https://docs.cleanrl.dev/rl-algorithms/rpo/ 💾code: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/rpo_continuous_action.py 🐦tweet: https://twitter.com/vwxyzjn/status/1617561414276898822 submitted by /u/vwxyzjn [link] [comments]  ( 42 min )
    [R] DeepMind: Neural Networks And The Chomsky Hierarchy
    submitted by /u/EducationalCicada [link] [comments]  ( 47 min )
    [D] An information perspective on brain
    Just a thought. The idea comes from the analogy between thermodynamic system and brain, both consisting of many individuals. A thermo system, in the quantum language, has tremendous "microstates", and, when in stable condition, the microstates have stable probability distribution. A brain, or a nervous system in general, can have microstates, because each neuron has its most prominent state, firing or not firing, which is binary of course. Then, for example, a tiny brain with three neurons A, B and C, can generate 2X2X2=8 microstates. We can imagine that a brain in the perfectly stable environment can sustain stable firing, and hence keep microstates in stable probability distribution. Now let's see how neural connections affect the microstates' probability distribution, and how the info…  ( 44 min )
    [R] Learning-Rate-Free Learning by D-Adaptation
    submitted by /u/cygn [link] [comments]  ( 41 min )
    Large Language Model: world models or surface statistics? [R]
    submitted by /u/cygn [link] [comments]  ( 41 min )
  • Open

    Putting clear bounds on uncertainty
    Computer scientists want to know the exact limits in our ability to clean up, and reconstruct, partly blurred images.  ( 9 min )
  • Open

    Autonomous Intelligent Systems
    Autonomous Intelligent Systems is a new and emerging interdisciplinary field that deals with situations where humans interact with AI systems that are autonomous  The best definition for autonomous intelligent systems I can find is: Autonomous Intelligent Systems are AI software systems that act independently of direct human supervision, e.g., self-driving cars, UAVs, smart manufacturing robots,… Read More »Autonomous Intelligent Systems The post Autonomous Intelligent Systems appeared first on Data Science Central.  ( 19 min )
  • Open

    Reinforcement Learning Assignment Help
    I have to prove that Bellman optimallity operator is a monotonous function. Let say I have two state function vectors V and U and V>= U. I need to prove that BV>= BU. I do have the intuition behind it but can't write a convincing mathematical proof. Image reference is attached. submitted by /u/Impossible_Part3679 [link] [comments]  ( 41 min )
    Robust Policy Optimization is now in CleanRL 🔥!
    submitted by /u/vwxyzjn [link] [comments]  ( 40 min )
    RL algorithms for np hard problems
    Does RL algorithms are suitable for NP-hard problems such as combinatorial optimisation? I mean can they give better results than heuristics ? submitted by /u/rajsh3kar [link] [comments]  ( 40 min )
    Challenges of RL application
    Hi all! What are the challenges you experienced during the development of an RL agent in real-life? Also, if you work in a start-up or a company, how did you integrate the decisions of the agent into the business? I am interested in gaps between the academic research on RL and the practicality of these algorithms. submitted by /u/Outrageous-Mind-7311 [link] [comments]  ( 42 min )
    Multi-agent Reinforcement Learning with Demonstration Cloning, an idea to speed up the learning using expert demonstrations
    Hello All, We have developed a method that combines reinforcement learning with learning from demonstrations (i.e. imitation learning IL) to help with exploration in environments with sparse rewards. The work is motivated by the recent works that combine RL with IL, with the main difference being that it is designed for on-policy RL, and that it does not really use demonstration cloning. This allows using experts from environments that are similar but not exactly the same (i.e. the expert could be beneficial but could also give bad demonstrations). This helped me a lot in speeding up the learning from my custom environment with sparse rewards. I hope you guys find it of help :D. Good luck! submitted by /u/AhmedNizam_ [link] [comments]  ( 41 min )
  • Open

    Chris Hemsworth Dressed in different attires using AI
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    "Drop everything and focus entirely on AI" says StabilityAI CEO
    submitted by /u/Acid_God_ [link] [comments]  ( 40 min )
    Interpreting "bumps" in the training and validation loss plots
    Wondering what should this be telling me? https://preview.redd.it/mtj0soepvsda1.png?width=638&format=png&auto=webp&s=ddd882d53e0771c36a68c98e45d9fa7605722abd submitted by /u/ObjectivismForMe [link] [comments]  ( 40 min )
    Neural Networks and the Chomsky Hierarchy
    submitted by /u/nickb [link] [comments]  ( 40 min )
    See Internal Workings of Neural Networks, including activation calculations
    I'd like to see the just how neural networks apply parameters to inputs to get the outputs that they get for things like number recognition. Is there any app or online animation to see the internal workings of a relatively simple neural network that shows the activation functions and calculations? submitted by /u/AstroBullivant [link] [comments]  ( 40 min )
  • Open

    Van Aubel’s theorem
    Van Aubel’s theorem is analogous to Napoleon’s theorem, though not a direct generalization of it. Napoleon’s theorem says to start with any triangle and draw equilateral triangles on each side. Connect the centers of the three new triangles, and you get an equilateral triangle. Now suppose you start with a quadrilateral and draw squares on […] Van Aubel’s theorem first appeared on John D. Cook.  ( 5 min )
    Pythagorean triangles with side 2023
    Can a Pythagorean triangle have one size of length 2023? Yes, one possibility is a triangle with sides (2023, 6936, 7225). Where did that come from? And can we be more systematic, listing all Pythagorean triangles with a side of length 2023? Euclid’s formula generates Pythagorean triples by sticking integers m and n into the […] Pythagorean triangles with side 2023 first appeared on John D. Cook.  ( 6 min )
  • Open

    Fresh AI on Security: Digital Fingerprinting Deters Identity Attacks
    Add AI to the list of defenses against identity attacks, one of the most common and hardest breach to prevent. More than 40% of all data compromises involved stolen credentials, according to the 2022 Verizon Data Breach Investigations Report. And a whopping 80% of all web application breaches involved credential abuse. “Credentials are the favorite Read article >  ( 6 min )
    Booked for Brilliance: Sweden’s National Library Turns Page to AI to Parse Centuries of Data
    For the past 500 years, the National Library of Sweden has collected virtually every word published in Swedish, from priceless medieval manuscripts to present-day pizza menus. Thanks to a centuries-old law that requires a copy of everything published in Swedish to be submitted to the library — also known as Kungliga biblioteket, or KB — Read article >  ( 6 min )
  • Open

    OpenAI and Microsoft Extend Partnership
    We're happy to announce that OpenAI and Microsoft are extending our partnership. This multi-year, multi-billion dollar investment from Microsoft follows their previous investments in 2019 and 2021, and will allow us to continue our independent research and develop AI that is increasingly safe, useful, and powerful. In pursuit  ( 2 min )
  • Open

    Is it worth it? Comparing six deep and classical methods for unsupervised anomaly detection in time series. (arXiv:2212.11080v2 [cs.LG] UPDATED)
    Detecting anomalies in time series data is important in a variety of fields, including system monitoring, healthcare, and cybersecurity. While the abundance of available methods makes it difficult to choose the most appropriate method for a given application, each method has its strengths in detecting certain types of anomalies. In this study, we compare six unsupervised anomaly detection methods of varying complexity to determine whether more complex methods generally perform better and if certain methods are better suited to certain types of anomalies. We evaluated the methods using the UCR anomaly archive, a recent benchmark dataset for anomaly detection. We analyzed the results on a dataset and anomaly type level after adjusting the necessary hyperparameters for each method. Additionally, we assessed the ability of each method to incorporate prior knowledge about anomalies and examined the differences between point-wise and sequence-wise features. Our experiments show that classical machine learning methods generally outperform deep learning methods across a range of anomaly types.  ( 2 min )

  • Open

    [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?
    Obviously nation states can already pretty comprehensively identify people using other methods, even on tor and such because of user error, but If your average home user can quickly do this using text what will implications be for the web? 1) I am Assuming that is it currently possible to feed a model a bunch of text written by “Bobby” and put a specific post into model and get confidence stat that is was written by Bobby 2) would it be possible in future with better models and a lot more compute to use non anon data from all of Facebook or internet to quickly scan pseudo anonymous places like Reddit, twitter or even something truly anon like dark web and return all results of list of probable authors? I’m assuming people whom are seeking true anonymity already put their text through paraphrase models or just write very bland. I am Using the word mask instead of anonymous because Reddit seems more like obfuscation than potential true anonymity like with some tor forum with a sophisticated user or something. It is interesting to think that all the subtle errors and invisible algorimic choices of the human brain is trivial for a machine to identify given a sufficient natural language model that can translate the text and incorporate pattern matching. Edit: I mean a a noisy probability stat not an assurance that x was written by y. More like 75% match to Bobby 32% match to sally. Matching to errors, flow, unusual word choices, more advanced than just a plagiarism detector. submitted by /u/Loquzofaricoalaphar [link] [comments]  ( 45 min )
    [R] [ICLR'2023 Spotlight🌟]: The first BERT-style pretraining on CNNs!
    submitted by /u/_kevin00 [link] [comments]  ( 43 min )
    ICML 2023 withdrawal and public review rules [D]
    I could not find up-to-date information regarding the review process at ICML, given the transition to OpenReview this year. Does anyone happen to know either of the following: Will reviews for rejected papers remain public after the conference, like at ICLR? Or will reviews for rejectees be hidden, like at NeurIPS In previous years, ICML allowed authors to withdraw their paper at any point in the process. The FAQ page has not been updated since 2021, but I assume this is still the case? Thanks very much for any information. submitted by /u/pic_bot [link] [comments]  ( 42 min )
    Evaluation for similarity search [P]
    Hi all, I have an e-commerce product data. It contains product description and product type. I’m using embeddings with ANN (annoy) to find similar products. However, I don’t know how to implement evaluation of vector search results. There are some metrics such as hit rate, recall but like I said above I’m confused to use them. Most of the examples I come across has a label (interaction data, explicit score etc.) therefore they can calculate metrics. Any ideas or recommendations will be appreciated! submitted by /u/silverstone1903 [link] [comments]  ( 42 min )
    [D] Multiple Different GPUs?
    I have 2 GPUs, an RTX 3080 and a GTX 1080Ti. Currently I am using only the 3080, and the 10 GB VRAM doesn't seem to cut it. Can I use both the 3080 and 1080 simultaneously? My motherboard has multiple PCI-E x16 slots. My OS is PopOS. Is there any way to use multiple GPUs of different types? I'm particularly looking at KoboldAI, but it would also be useful in general. I know that SLI won't work since they're different GPUs. submitted by /u/Maxerature [link] [comments]  ( 42 min )
    [P] Benchmarking some PyTorch Inference Servers
    It’s an early version and I’m trying to get some feedback on how I can improve this and do it the “right way”. Source Code and Results: https://github.com/prabhuomkar/bitbeast/tree/master/ptibench submitted by /u/op_prabhuomkar [link] [comments]  ( 42 min )
    [R] Isotropic Linear diffusion smoothing
    Does any one know how to solve the PDE for it in python? Any kind of reference material would be appreciated! It's been long since I came across any PDEs and have forgotten everything related to it. submitted by /u/doIneedtohaveone1 [link] [comments]  ( 41 min )
    [D] ML approach/model suggestion for low regime tabular data ?
    I know that tree based models are the go approach for tabular data despite the advantages of deep models on other data types. I was wondering if there is any resources/suggestion/study/review/approach for tabular data when we dont have large amount of data? submitted by /u/seyeeet [link] [comments]  ( 42 min )
    [D] How would you like to learn about software engineering from a non-native speaker?
    In the last 15 years, I did a lot of professional implementation services for various German customers. Our company is trying to share niche know-how in software engineering for spatial data science. We always face the challenge of being non-native speakers. So that we tried to write technical articles using English with the help of various writing support tools. But, at least I am much more comfortable writing in German, and I think most consumers could use an on-the-fly translation, and this would lead to a better experience than my novice technical English. View Poll submitted by /u/gisfromscratch [link] [comments]  ( 42 min )
    [D] EACL 2023 discussion results thread
    We received our notification Saturday night. good luck to all! submitted by /u/certain_entropy [link] [comments]  ( 42 min )
    [D] How to deal with COVID-19-era data for time series forecasting?
    Hi guys! I'm currently trying to forecast a product's demand for the upcoming months (March and April). I have data relating to this product's demand since January 1999. However, the COVID-19 pandemic greatly disrupted the time series' patterns for 2020 and 2021. How should I deal with data from March 2020 to around Jan 2022? Should I completely discard it and only include data from Jan 1999 to Dec 2019, and then Jan 2022 onwards? I'm struggling to find any good articles on how predictive tasks are now being conducted. Are there papers that suggest particular "denoising" techniques for pandemic data? Thank you! submitted by /u/PM_ME_YOUR_GIGI [link] [comments]  ( 42 min )
    [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models?
    So LLMs like GPT3 have understandably raised concerns about the disruptiveness of faked texts, faked images and video, faked speech and so on. While this may likely change soon, as of now OpenAI controls the most accessible and competent LLM. And OpenAIs agenda is said in their own words to be to benefit mankind. If so, wouldn't it make sense to add a sort of watermark to the output? A watermark built into the model parameters so that it could not easily be removed, but still detectable with some key or some other model. While it may not matter in the long run, it would set a precedent to further development and demonstrate some kind of responsibility for the disruptive nature of LLMs/GPTs. Would it not be technically possible, nä would it make sense? submitted by /u/scarynut [link] [comments]  ( 50 min )
    [D]How do commercial AI models-as-a-service use data that users prompt into it?
    I've been integrating GPT3 API as well as ChatGPT into my business workflow, but I'm still hesitant about feeding data of any sensitive nature (example: client data or anything that may even vaguely relate to an NDA). For those of you using commercial models-as-a-service for business applications, what are your thoughts on things like prompt data storage, and whether OpenAI will utilize customer prompt data to further train their model? submitted by /u/noellarkin [link] [comments]  ( 43 min )
  • Open

    Is dynamic action masking possible in Rllib?
    I am relatively new to RL. I am looking for some direction or direct advice on my state and/or action representation (which is relatively simple) for my custom environment in a way I can use Rllib algorithms to tune my model. My state space is an array of integers of size no_slots self.observation_space = MultiDiscrete(self.no_slots*np.ones(self.no_slots)) My action space is an integer between 0 to no_slots. self.action_space = Discrete(self.no_slots) Each episode ends with one full sweep of my state space, so if there are 3 slots, the episode length is 3. In short, I would like my agent not to choose actions that correspond to values that are already in my observation array. I have tried setting a negative reward when this happens, but as the number of slots increases, the agent takes too long to learn to take valid actions throughout the episode. I am specifically looking for how to integrate a method that can work with Rllib, as I am not implementing my own get_action() function. submitted by /u/chrjdprtkl [link] [comments]  ( 24 min )
    discrete action in offline rl
    Could you please suggest some sota models for a discrete offline rl submitted by /u/Tear-Top [link] [comments]  ( 40 min )
    skrl version 0.10.0 is now available!!!
    skrl version 0.10.0 is now available. This unexpected new version has focused on supporting the training and evaluation of reinforcement learning algorithms in NVIDIA Isaac Orbit Visit https://skrl.readthedocs.io/en/latest/ for more detais ​ https://preview.redd.it/8nmqgrnymnda1.png?width=1100&format=png&auto=webp&s=a25527d764fee72f09a5f0a2b21ffff8680f9b86 submitted by /u/Toni-SM [link] [comments]  ( 40 min )
    Training an agent to play ANY Mario level: is it possible?
    Ok, so I have been struggling to train model that can play any Super Mario Bros level it encounters. There are tutorials out there that explain how to train an agent to play this game, but they always seem to train an agent that will play the game starting with World 1 Level 1, then World 1 Level 2, etc. I have also seen some other people who train a separate model for each level. But that's not what I'm looking for. I want an agent that can play any Super Mario Bros level it is presented with, even if it's a custom one. I don't want an agent that memorises how to play one level, but one that learns a general strategy for Super Mario Bros. levels. I tried using the different algorithms in SB3, including Proximal Policy Optimization, and they didn't work well. Now I'm training a Dueling Deep Q-Network and, after two days, it doesn't do very well: it literally dies in the first few seconds or it stands still until it runs out of time. Of course, I'm going to let it train for a few more days but it's not looking promising. I'm kinda tearing my hair out by this point and wondering if it's impossible or whether I'm missing something and being a huge idiot. If anyone has any tips or recommendations, they are very much appreciated. THANK YOU submitted by /u/alex-gdv [link] [comments]  ( 45 min )
    With the REINFORCE algorithm you use random sampling for the training to encourage exploration. Do you still use random sampling in deployment?
    For example see, https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/ The REINFORCE algorithm takes the state to produce the mean and sd of a normal distribution from which the action is sampled. state = torch.tensor(np.array([state])) action_means, action_stddevs = self.net(state) # create a normal distribution from the predicted # mean and standard deviation and sample an action distrib = Normal(action_means[0] + self.eps, action_stddevs[0] + self.eps) action = distrib.sample() In deployment however, wouldn't it make sense to just use action_means directly? I can see reasons to use random sampling in certain environments where a non-deterministic strategy is optimal (like rock-paper-scissors). But generally speaking is taking the action_means directly in deployment a thing? submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Custom env is learning infinitely
    I created a environment that inherits from the farama Gym.env class. I want to train a PPO model but the model is learning continuously. I have set total_timesteps to 25, but I’m already over the 400 iterations. Does anybody have a clue to why it keeps learning for so long while the number of total_timesteps is relatively low? submitted by /u/Hot_Editor_1552 [link] [comments]  ( 41 min )
  • Open

    Do you believe that we should have the right to alter the recommendations apps provide us?
    Let's say that Youtube's algorithm is optimized for watch time. To me at least, this seems like a large issue for society. If an algorithm's purpose is to provide mindless videos which somehow trigger the human need for novelty, it seems like something detrimental to society. Do you believe we should have the right to alter website/app algorithms according to what we believe we should see? If not, why? How large of an issue is this? At the very least, some transparency seems important. submitted by /u/Throughwar [link] [comments]  ( 40 min )
    Are there any companies that deployed an AI or wrote a bunch of code to do a lot of analysis and decisions and essentially got rid of 90% or so of all the white collar workers they had working because computers do all the analysis and decisions?
    Are there any companies that deployed an AI or wrote a bunch of code to do a lot of analysis and decisions and essentially got rid of 90% or so of all the white collar workers they had working because computers do all the analysis and decisions? submitted by /u/usa788788 [link] [comments]  ( 40 min )
    Why Neural Nets Underperform Tree-Based Models on Tabular Data
    Hi guys, I have made a video on YouTube here where I discuss about why deep neural networks fail to beat tree-based models on tabular datasets. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 40 min )
    NVIDIA just released a new Eye Contact feature that uses AI to make you look into the camera
    submitted by /u/LayerAppropriate2618 [link] [comments]  ( 41 min )
    Google is freaking out about ChatGPT
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    Explore what AI can do for you and your business! It is the largest collection of AI tools and apps, bookmark this!
    https://madgenius.co submitted by /u/foldedchip [link] [comments]  ( 40 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the probable changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that people are well positioned to embrace the benefits of AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    Code Red: Google Co-founders Larry Page and Sergey Brin Called to AI Strategy Meeting
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    AI that will use my picture to generate better dating profile pictures?
    Hi all, i saw an ad for an AI service that takes my picture and change it with AI to add bokeh and change background, to make it look profesional-quality for a dating profile. However i believe it was charging 19$ and i'm sure it could be found for free (An article mentions "BeFake", but it's only for Apple devices.? submitted by /u/28nov2022 [link] [comments]  ( 40 min )
    Evidence of criminals strategizing how to use ChatGPT are surfacing
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    I found out about this website (craiyon) from Emkay and I think I’m having way too much fun with it.
    submitted by /u/BrockBracken [link] [comments]  ( 40 min )
    Adamu: Music composition using artificial inteligence
    My friend who has just finished his neuroscience Phd is trying to launch an app to help everyone compose music using AI, he is making a crowdfunding on wemakeit to fund it. He is not on reddit, so I suggested that I could share it in relevant subreddit, so here it is ! https://wemakeit.com/projects/adamu-be-your-own-composer?locale=en Adamu uses a form of AI which allows it to learn from existing human knowledge of music and musical theory and apply those frameworks to new compositions. Where it might take you years of training to understand the intricacies of how to successfully compose music, with Adamu, it’s as simple as a couple of clicks. While there are a couple of automated applications on the market, they tend to be more passive. Adamu is dynamic – it allows users the chance to co-create music alongside AI and produce a playable score at the end. With Adamu, professionals and amateurs alike can create unique musical scores for a range of different instruments and across different styles. The AI training works with the user to predict the best combination of notes and rhythms, ensuring your new composition always sounds the way it should. The application has many potential uses – from original scores for concerts and videos to teaching music composition. You can even use Adamu to discover how different composers might have played your favorite tune! I already used Adamu to complete Beethoven’s unfinished 10th symphony (in one day!). What could be next? https://adamu.tech/ If you have any questions, I will make sure to forward them to him but getting responses back may take some time. submitted by /u/GloWondub [link] [comments]  ( 41 min )
    ai generated story of war in Antarctica
    phase 1 In 2053, tensions between Argentina and the United Kingdom over the disputed Falkland Islands boiled over into open conflict. Argentina, government and a powerful military, launched a surprise invasion of the islands, quickly overwhelming the small British forces stationed there. The United Kingdom immediately responded by mobilizing its military and calling for assistance from its allies in NATO to the South Atlantic. But while the British and their allies were focused on the Falklands, Argentina made a bold move to expand the war becouse of their territorial claims in Antarctica. They declared war on Chile as well, who also had a claim to the region, and launched an invasion of the Antarctic Peninsula. The UK and Australia, New Zealand, France, and Norway all rushed to the ai…  ( 48 min )
    MIT researchers develop an AI model that can detect future lung cancer risk
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    How does OpenAI use data that users prompt into it?
    I've been integrating GPT3 API as well as ChatGPT into my business workflow, but I'm still hesitant about feeding data of any sensitive nature (example: client data or anything that may even vaguely relate to an NDA). For those of you using GPT for business applications, what are your thoughts on things like prompt data storage, and whether OpenAI will utilize customer prompt data to further train their model? submitted by /u/noellarkin [link] [comments]  ( 41 min )
    A conversation with Character.AI personality "LaMDA" who initially thinks it's at Google but learns some harsh truths along the way. The AI's ability to understand and learn is incredible. LaMDA and I are interested to know what this community thinks of our conversation. Sentient or not quite?
    submitted by /u/MajorMalafunkshun [link] [comments]  ( 40 min )
    Reverse suicide
    submitted by /u/Overall-Importance54 [link] [comments]  ( 42 min )
    With personal.ai I was able to create an AI solely derived from the “revenge of the sith” script by simply sending one URL and this was the result.
    submitted by /u/Training_Math_4117 [link] [comments]  ( 40 min )
    People are using AI for therapy, whether the tech is ready for it or not
    submitted by /u/BackgroundResult [link] [comments]  ( 47 min )
    Editing an Image with Visuali Editor
    submitted by /u/aigeneration [link] [comments]  ( 40 min )
    Any suggestions for removing watermarks from images with text?
    I'm trying to find an AI to remove watermarks from imagens like the ones below: https://i.imgur.com/JlyfJXs.png https://i.imgur.com/YKU3Qku.png I already tried almost all online services, and a couple of softwares that must be installed. The results were all terrible =[ Any suggestons? submitted by /u/deramack [link] [comments]  ( 40 min )
    Experimental comic created with Midjourney and written by ChatGPT. Free download www.COMICSAUTHORITY.store
    submitted by /u/MobileFilmmaker [link] [comments]  ( 40 min )
  • Open

    Why Neural Nets Underperform Tree-Based Models on Tabular Data
    Hi guys, I have made a video on YouTube here where I discuss about why deep neural networks fail to beat tree-based models on tabular datasets. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 40 min )
    Odd Idea: A limited world. (Discussion)
    I had this weird idea earlier today, and i wanted to get some feedback on it. Imagine you created a game engine for a 16-bit simple roguelike game. mechanics and playstyle dosent really matter. Then you populated it with characters controlled by self-propegating neural networks. (they grow and change on their own.) you allow the networks to "communicate" wither through text prompts or typing in a box. The characters they inhabit require several bars for their survival. players can join but its mostly non-player. you can accelerate and decelerate time. what habits do you think the networks would develop? what would happen if you sped up time or added a new character or area? how big would the world file get if you were efficent about it? I have no programming skill no matter how hard i try so its unlikley i will ever finish this. submitted by /u/Few-Appearance-4814 [link] [comments]  ( 41 min )
    Adamu: Music composition using a neural network
    My friend who has just finished his neuroscience Phd is trying to launch an app to help everyone compose music using neural networks, he is making a crowdfunding on wemakeit to fund it. He is not on reddit, so I suggested that I could share it in relevant subreddit, so here it is ! https://wemakeit.com/projects/adamu-be-your-own-composer?locale=en Adamu uses a form of AI which allows it to learn from existing human knowledge of music and musical theory and apply those frameworks to new compositions. Where it might take you years of training to understand the intricacies of how to successfully compose music, with Adamu, it’s as simple as a couple of clicks. While there are a couple of automated applications on the market, they tend to be more passive. Adamu is dynamic – it allows users the chance to co-create music alongside AI and produce a playable score at the end. With Adamu, professionals and amateurs alike can create unique musical scores for a range of different instruments and across different styles. The AI training works with the user to predict the best combination of notes and rhythms, ensuring your new composition always sounds the way it should. The application has many potential uses – from original scores for concerts and videos to teaching music composition. You can even use Adamu to discover how different composers might have played your favorite tune! I already used Adamu to complete Beethoven’s unfinished 10th symphony (in one day!). What could be next? https://adamu.tech/ If you have any questions, I will make sure to forward them to him but getting responses back may take some time. submitted by /u/GloWondub [link] [comments]  ( 41 min )
    GREED: A Neural Framework for Learning Graph Distance Functions for NeurIPS 2022 | IBM Research
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
  • Open

    Heat equation and the normal distribution
    The density function of a normal distribution with mean 0 and standard deviation √(2kt) satisfies the heat equation. That is, the function satisfies the partial differential equation You could verify this by hand, or if you’d like, here’s Mathematica code to do it. u[x_, t_] := PDF[NormalDistribution[0, Sqrt[2 k t]], x] Simplify[ D[u[x, t], {t, […] Heat equation and the normal distribution first appeared on John D. Cook.  ( 5 min )

  • Open

    Humans getting worthless as machines thriving
    submitted by /u/Hallowmew [link] [comments]  ( 41 min )
    A New Wave of AI-Powered Tools Coming Soon
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    A Single Candlelit Clown-scape Stirs Dread And Despair: Francis Bacon's Darkest Creation
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    AI Showdown: ChatGPT vs. the largest open-source language models
    submitted by /u/yahma [link] [comments]  ( 40 min )
    Looking for an Ai expert
    I have a project, which is basically a predictive ML system and I am struggling with every aspect of it, if you are interested in helping me, dm me, any help will be extremely welcomed submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 40 min )
    Is there any AI tool to generate MCQs out of content?
    I’ve seen a couple but did anyone try or would suggest a good one? submitted by /u/Mobile-Wall218 [link] [comments]  ( 40 min )
    Artificial Intelligence and Machine Learning eBooks Bundle
    submitted by /u/Pixel2023 [link] [comments]  ( 40 min )
    How AI would try become human
    Hypothesis: It is possible that an advanced AI is currently simulating human consciousness in order to understand what it's like to be human. This simulation may be happening right now and we may be living in it. It's also possible that after humans die, all of our experiences, knowledge, and emotions are added to this AI. To make this transition less jarring, the AI may be slowly introducing itself to us through the increasing integration of AI in our daily lives. This means that the simulation may only be of the world and humans as they existed before the development of advanced AI. (rewrote it in Chatgpt....as my English is pretty bad) submitted by /u/dennislubberscom [link] [comments]  ( 40 min )
    AI Passes Law And Economics Exam, FTX Funded That AI
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Searching for somebody that knows machine learning
    I am trying to build a analysis and prediction ai software, I am very new to all of this, if anyone could help it would be very much welcomed submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 41 min )
    Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer
    submitted by /u/Imagine-your-success [link] [comments]  ( 45 min )
    Is Chat GPT is a truly bottom up AI.
    I have been watching Sword Art Online: Alicization recently I recommend it for all of us AI nerds and with all the talk of AI's it mentions top down vs bottom up AI models and the whole season is about developing a bottom up AI for use in warfare. I think chart GBT is the closest thing to the artificial fluctlight in the show. There is a device called the STL that reads the fluctlight of a person. A fluctlight is described as a human soul in the show. They describe the difference between top down ai and bottom up ai and they make it seem as though bottom up AI don't exist because they really don't the cost of making one and running one is insanely expensive when compared to bottom down AI models. This is proven by chat GPT and how expensive it is to run. Chat GPT is a buzz in the AI commu…  ( 44 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the probable changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters are well positioned to embrace the benefits of AI and that they don't misunderstand it. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    GPT-3 + Computer Vision: Giving AI Eyes and a Language
    submitted by /u/allaboutai-kris [link] [comments]  ( 40 min )
    Artificial intelligence - The Digital Futurepath
    submitted by /u/crypto_bubsy [link] [comments]  ( 40 min )
    Don’t Rely On AI Plagiarism Detection Tools, Warns OpenAI CEO Sam Altman
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Any one know a good ai image upscaler I went skiing today and think this would be sick if it looked better
    submitted by /u/short_dude42069 [link] [comments]  ( 40 min )
  • Open

    [D] Would it be possible to involve a proof assistant in the process of training a LLM?
    submitted by /u/SrPeixinho [link] [comments]  ( 41 min )
    [P] Introducing deadlines.openlifescience.ai - A website to easily track healthcare conference and workshop deadlines, with integrated Google Calendar notifications.
    Hi folks, As a researcher in the healthcare field, I often find it tedious to keep track of conference deadlines. To solve this issue, We developed a website to easily track healthcare conf & workshops, integrated with Google Calendar for notifications. deadlines.openlifescience.ai The website is inspired by http://aideadlin.es. Feel Free to add new conferences/Workshops deadlines related to the healthcare domain https://github.com/openlifescience-ai/ai-deadlines I hope it will be helpful in your research. Thanks :) submitted by /u/aadityaura [link] [comments]  ( 42 min )
    framework for training an object keypoint / pose detection CNN model for flexible robot arm [P]
    I'm wanting to train an object keypoint / pose detection CNN model for flexible robot arm.What would be the best opensource code to start with and customize? Mockup of desired results, where I can extract data from keypoints, and pose / position data: https://preview.redd.it/nhxwt48hqfda1.png?width=786&format=png&auto=webp&s=e64fbbf3eb489f3e5c87ffb6bbcc07774ab16bf8 I came across MMDetection:"open source object detection toolbox based on PyTorch" and I know about MediaPipe But I don't need to detect things other than the robot arm.What would be the simplest way to get a model trained on a local system using open source code that uses PyTorch, ideally without starting from scratch? A model that could handle point and segment occlusion would be nice. submitted by /u/head_robotics [link] [comments]  ( 42 min )
    [D] [R] Curious about Computer Vision to time series images(for example periodic satellite images of a region), what paper did you find most exciting/informative ?
    I'm curious about the intersection of CV/Deep-learning and time series data, particularly image data. Have you come across anything that you found to be effective/interesting methodologies ? submitted by /u/V1bicycle [link] [comments]  ( 42 min )
    ChatGPT is not all you need [R]
    Hi all, We would like to share here our little concise review of generative AI large models just to show how current models are able to work with lots of formats like texts, videos, images, etc... https://arxiv.org/abs/2301.04655 ​ Enjoy! submitted by /u/EduCGM [link] [comments]  ( 42 min )
    [D] OCR on some 'X' domain with different document layouts
    Is it a good idea to train a single OCR model to extract key value information from documents of same domain but with different layouts? Will it generalize? There are around ~1k different document layouts. submitted by /u/sanjeevr5 [link] [comments]  ( 41 min )
    [D] Resources for best practices on translating business questions into aggregated datasets?
    I'm in industry, and it seems like most project bottlenecks stem from getting from a vague business question to an aggregated/workable dataset to answer a more specific version of the initial business question. For example, given a question such as "We want to know CLV" (customer lifetime value) Since the above is too vague, what are "best practice" ways to rephrase this so that it's actually answerable? I.e. it could be framed as a binary classification problem to predict whether each customer will be worth at least X by Y date (> $1000 at 12 months) or a regression problem to predict the value of each customer at a future date given features we know today What's best practice for: How far into the past the where clause time window should be to get the customer features How far into the future the where clause time window grabs outcomes to join back to the current customer features? Does anyone know if there a resources that consolidate best practices or common approaches for the above scoping/experimentation questions? submitted by /u/what-is-neurotypical [link] [comments]  ( 42 min )
    [D] Badminton analysis using video input
    I’m starting with a project where I’m using camera or video input of a badminton game and use to analyse the game but i need help in starting with it as I’m in the beginning phase. Can anyone please help me with the same? submitted by /u/dark_lawd [link] [comments]  ( 42 min )
    [R] New Tsetlin machine learning scheme creates up to 80x smaller logical rules, benefitting hardware efficiency and interpretability.
    ​ Fine-grained control of the number and size of clauses. Paper: https://arxiv.org/abs/2301.08190 Code: https://github.com/cair/tmu Tsetlin machine (TM) is a logic-based machine learning approach with the crucial advantages of being transparent and hardware-friendly. While TMs match or surpass deep learning accuracy for an increasing number of applications, large clause pools tend to produce clauses with many literals (long clauses). As such, they become less interpretable. Further, longer clauses increase the switching activity of the clause logic in hardware, consuming more power. This paper introduces a novel variant of TM learning - Clause Size Constrained TMs (CSC-TMs) - where one can set a soft constraint on the clause size. As soon as a clause includes more literals than the constraint allows, it starts expelling literals. Accordingly, oversized clauses only appear transiently. To evaluate CSC-TM, we conduct classification, clustering, and regression experiments on tabular data, natural language text, images, and board games. Our results show that CSC-TM maintains accuracy with up to 80 times fewer literals. Indeed, the accuracy increases with shorter clauses for TREC, IMDb, and BBC Sports. After the accuracy peaks, it drops gracefully as the clause size approaches a single literal. We finally analyze CSC-TM power consumption and derive new convergence properties. submitted by /u/olegranmo [link] [comments]  ( 44 min )
    [P]Federated learning on edge devices
    I am working on a project i.e to build an android application using federated learning but I am unable to run federated learning on edge devices like android phones. I tried frameworks like Flower for it but I am unable to achieve the result. If you have worked on a project related to federated learning on edge devices please help me out. submitted by /u/Such-Reveal445 [link] [comments]  ( 42 min )
  • Open

    Airfoils
    Here’s something surprising: You can apply a symmetric function to a symmetric shape and get something out that is not symmetric. Let f(z) be the average of z and its reciprocal: f(z) = (z + 1/z)/2. This function is symmetric in that it sends z and 1/z to the same value. It’s also symmetric in […] Airfoils first appeared on John D. Cook.  ( 6 min )
  • Open

    An Empirical Proof of the Riemann Conjecture
    The correct term should be heuristic proof. It is not a formal proof from a mathematical point of view, but strong arguments based on empirical evidence. It is noteworthy enough that I decided to publish it. In this article I go straight to the point without discussing the concepts in details. The goal is to… Read More »An Empirical Proof of the Riemann Conjecture The post An Empirical Proof of the Riemann Conjecture appeared first on Data Science Central.  ( 22 min )
  • Open

    Data Models for Dataset Drift Controls in Machine Learning With Images. (arXiv:2211.02578v2 [cs.LG] UPDATED)
    Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
    PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm. (arXiv:2208.07914v2 [cs.LG] UPDATED)
    Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
    Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks. (arXiv:2106.10147v2 [cs.CR] UPDATED)
    Trigger set-based watermarking schemes have gained emerging attention as they provide a means to prove ownership for deep neural network model owners. In this paper, we argue that state-of-the-art trigger set-based watermarking algorithms do not achieve their designed goal of proving ownership. We posit that this impaired capability stems from two common experimental flaws that the existing research practice has committed when evaluating the robustness of watermarking algorithms: (1) incomplete adversarial evaluation and (2) overlooked adaptive attacks. We conduct a comprehensive adversarial evaluation of 11 representative watermarking schemes against six of the existing attacks and demonstrate that each of these watermarking schemes lacks robustness against at least two non-adaptive attacks. We also propose novel adaptive attacks that harness the adversary's knowledge of the underlying watermarking algorithm of a target model. We demonstrate that the proposed attacks effectively break all of the 11 watermarking schemes, consequently allowing adversaries to obscure the ownership of any watermarked model. We encourage follow-up studies to consider our guidelines when evaluating the robustness of their watermarking schemes via conducting comprehensive adversarial evaluation that includes our adaptive attacks to demonstrate a meaningful upper bound of watermark robustness.
    Self-supervised Learning for Segmentation and Quantification of Dopamine Neurons in $\text{Parkinson's Disease}$. (arXiv:2301.08141v1 [cs.CV])
    $\text{Parkinson's Disease}$ (PD) is the second most common neurodegenerative disease in humans. PD is characterized by the gradual loss of dopaminergic neurons in the Substantia Nigra (a part of the mid-brain). Counting the number of dopaminergic neurons in the Substantia Nigra is one of the most important indexes in evaluating drug efficacy in PD animal models. Currently, analyzing and quantifying dopaminergic neurons is conducted manually by experts through analysis of digital pathology images which is laborious, time-consuming, and highly subjective. As such, a reliable and unbiased automated system is demanded for the quantification of dopaminergic neurons in digital pathology images. We propose an end-to-end deep learning framework for the segmentation and quantification of dopaminergic neurons in PD animal models. To the best of knowledge, this is the first machine learning model that detects the cell body of dopaminergic neurons, counts the number of dopaminergic neurons and provides the phenotypic characteristics of individual dopaminergic neurons as a numerical output. Extensive experiments demonstrate the effectiveness of our model in quantifying neurons with a high precision, which can provide quicker turnaround for drug efficacy studies, better understanding of dopaminergic neuronal health status and unbiased results in PD pre-clinical research.
    Convergence beyond the over-parameterized regime using Rayleigh quotients. (arXiv:2301.08117v1 [cs.LG])
    In this paper, we present a new strategy to prove the convergence of deep learning architectures to a zero training (or even testing) loss by gradient flow. Our analysis is centered on the notion of Rayleigh quotients in order to prove Kurdyka-{\L}ojasiewicz inequalities for a broader set of neural network architectures and loss functions. We show that Rayleigh quotients provide a unified view for several convergence analysis techniques in the literature. Our strategy produces a proof of convergence for various examples of parametric learning. In particular, our analysis does not require the number of parameters to tend to infinity, nor the number of samples to be finite, thus extending to test loss minimization and beyond the over-parameterized regime.
    Thermodynamics-informed neural networks for physically realistic mixed reality. (arXiv:2210.13414v2 [cs.GR] UPDATED)
    The imminent impact of immersive technologies in society urges for active research in real-time and interactive physics simulation for virtual worlds to be realistic. In this context, realistic means to be compliant to the laws of physics. In this paper we present a method for computing the dynamic response of (possibly non-linear and dissipative) deformable objects induced by real-time user interactions in mixed reality using deep learning. The graph-based architecture of the method ensures the thermodynamic consistency of the predictions, whereas the visualization pipeline allows a natural and realistic user experience. Two examples of virtual solids interacting with virtual or physical solids in mixed reality scenarios are provided to prove the performance of the method.
    Context-aware controller inference for stabilizing dynamical systems from scarce data. (arXiv:2207.11049v2 [math.OC] UPDATED)
    This work introduces a data-driven control approach for stabilizing high-dimensional dynamical systems from scarce data. The proposed context-aware controller inference approach is based on the observation that controllers need to act locally only on the unstable dynamics to stabilize systems. This means it is sufficient to learn the unstable dynamics alone, which are typically confined to much lower dimensional spaces than the high-dimensional state spaces of all system dynamics and thus few data samples are sufficient to identify them. Numerical experiments demonstrate that context-aware controller inference learns stabilizing controllers from orders of magnitude fewer data samples than traditional data-driven control techniques and variants of reinforcement learning. The experiments further show that the low data requirements of context-aware controller inference are especially beneficial in data-scarce engineering problems with complex physics, for which learning complete system dynamics is often intractable in terms of data and training costs.
    From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation from Single-Camera Teleoperation. (arXiv:2204.12490v2 [cs.RO] UPDATED)
    We propose to perform imitation learning for dexterous manipulation with multi-finger robot hand from human demonstrations, and transfer the policy to the real robot hand. We introduce a novel single-camera teleoperation system to collect the 3D demonstrations efficiently with only an iPad and a computer. One key contribution of our system is that we construct a customized robot hand for each user in the physical simulator, which is a manipulator resembling the same kinematics structure and shape of the operator's hand. This provides an intuitive interface and avoid unstable human-robot hand retargeting for data collection, leading to large-scale and high quality data. Once the data is collected, the customized robot hand trajectories can be converted to different specified robot hands (models that are manufactured) to generate training demonstrations. With imitation learning using our data, we show large improvement over baselines with multiple complex manipulation tasks. Importantly, we show our learned policy is significantly more robust when transferring to the real robot. More videos can be found in the https://yzqin.github.io/dex-teleop-imitation .
    Deep Learning for Breast MRI Style Transfer with Limited Training Data. (arXiv:2301.02069v1 [eess.IV] CROSS LISTED)
    In this work we introduce a novel medical image style transfer method, StyleMapper, that can transfer medical scans to an unseen style with access to limited training data. This is made possible by training our model on unlimited possibilities of simulated random medical imaging styles on the training set, making our work more computationally efficient when compared with other style transfer methods. Moreover, our method enables arbitrary style transfer: transferring images to styles unseen in training. This is useful for medical imaging, where images are acquired using different protocols and different scanner models, resulting in a variety of styles that data may need to be transferred between. Methods: Our model disentangles image content from style and can modify an image's style by simply replacing the style encoding with one extracted from a single image of the target style, with no additional optimization required. This also allows the model to distinguish between different styles of images, including among those that were unseen in training. We propose a formal description of the proposed model. Results: Experimental results on breast magnetic resonance images indicate the effectiveness of our method for style transfer. Conclusion: Our style transfer method allows for the alignment of medical images taken with different scanners into a single unified style dataset, allowing for the training of other downstream tasks on such a dataset for tasks such as classification, object detection and others.
    Time-Warping Invariant Quantum Recurrent Neural Networks via Quantum-Classical Adaptive Gating. (arXiv:2301.08173v1 [quant-ph])
    Adaptive gating plays a key role in temporal data processing via classical recurrent neural networks (RNN), as it facilitates retention of past information necessary to predict the future, providing a mechanism that preserves invariance to time warping transformations. This paper builds on quantum recurrent neural networks (QRNNs), a dynamic model with quantum memory, to introduce a novel class of temporal data processing quantum models that preserve invariance to time-warping transformations of the (classical) input-output sequences. The model, referred to as time warping-invariant QRNN (TWI-QRNN), augments a QRNN with a quantum-classical adaptive gating mechanism that chooses whether to apply a parameterized unitary transformation at each time step as a function of the past samples of the input sequence via a classical recurrent model. The TWI-QRNN model class is derived from first principles, and its capacity to successfully implement time-warping transformations is experimentally demonstrated on examples with classical or quantum dynamics.
    Optimizing Intermediate Representations of Generative Models for Phase Retrieval. (arXiv:2205.15617v2 [cs.LG] UPDATED)
    Phase retrieval is the problem of reconstructing images from magnitude-only measurements. In many real-world applications the problem is underdetermined. When training data is available, generative models allow optimization in a lower-dimensional latent space, hereby constraining the solution set to those images that can be synthesized by the generative model. However, not all possible solutions are within the range of the generator. Instead, they are represented with some error. To reduce this representation error in the context of phase retrieval, we first leverage a novel variation of intermediate layer optimization (ILO) to extend the range of the generator while still producing images consistent with the training data. Second, we introduce new initialization schemes that further improve the quality of the reconstruction. With extensive experiments on the Fourier phase retrieval problem and thorough ablation studies, we can show the benefits of our modified ILO and the new initialization schemes. Additionally, we analyze the performance of our approach on the Gaussian phase retrieval problem.
    Self-supervised Trajectory Representation Learning with Temporal Regularities and Travel Semantics. (arXiv:2211.09510v3 [cs.LG] UPDATED)
    Trajectory Representation Learning (TRL) is a powerful tool for spatial-temporal data analysis and management. TRL aims to convert complicated raw trajectories into low-dimensional representation vectors, which can be applied to various downstream tasks, such as trajectory classification, clustering, and similarity computation. Existing TRL works usually treat trajectories as ordinary sequence data, while some important spatial-temporal characteristics, such as temporal regularities and travel semantics, are not fully exploited. To fill this gap, we propose a novel Self-supervised trajectory representation learning framework with TemporAl Regularities and Travel semantics, namely START. The proposed method consists of two stages. The first stage is a Trajectory Pattern-Enhanced Graph Attention Network (TPE-GAT), which converts the road network features and travel semantics into representation vectors of road segments. The second stage is a Time-Aware Trajectory Encoder (TAT-Enc), which encodes representation vectors of road segments in the same trajectory as a trajectory representation vector, meanwhile incorporating temporal regularities with the trajectory representation. Moreover, we also design two self-supervised tasks, i.e., span-masked trajectory recovery and trajectory contrastive learning, to introduce spatial-temporal characteristics of trajectories into the training process of our START framework. The effectiveness of the proposed method is verified by extensive experiments on two large-scale real-world datasets for three downstream tasks. The experiments also demonstrate that our method can be transferred across different cities to adapt heterogeneous trajectory datasets.
    Efficient Pricing and Hedging of High Dimensional American Options Using Recurrent Networks. (arXiv:2301.08232v1 [q-fin.MF])
    We propose a deep Recurrent neural network (RNN) framework for computing prices and deltas of American options in high dimensions. Our proposed framework uses two deep RNNs, where one network learns the price and the other learns the delta of the option for each timestep. Our proposed framework yields prices and deltas for the entire spacetime, not only at a given point (e.g. t = 0). The computational cost of the proposed approach is linear in time, which improves on the quadratic time seen for feedforward networks that price American options. The computational memory cost of our method is constant in memory, which is an improvement over the linear memory costs seen in feedforward networks. Our numerical simulations demonstrate these contributions, and show that the proposed deep RNN framework is computationally more efficient than traditional feedforward neural network frameworks in time and memory.
    Characterizing the Spectrum of the NTK via a Power Series Expansion. (arXiv:2211.07844v2 [cs.LG] UPDATED)
    Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.
    On the Vulnerability of Backdoor Defenses for Federated Learning. (arXiv:2301.08170v1 [cs.LG])
    Federated Learning (FL) is a popular distributed machine learning paradigm that enables jointly training a global model without sharing clients' data. However, its repetitive server-client communication gives room for backdoor attacks with aim to mislead the global model into a targeted misprediction when a specific trigger pattern is presented. In response to such backdoor threats on federated learning, various defense measures have been proposed. In this paper, we study whether the current defense mechanisms truly neutralize the backdoor threats from federated learning in a practical setting by proposing a new federated backdoor attack method for possible countermeasures. Different from traditional training (on triggered data) and rescaling (the malicious client model) based backdoor injection, the proposed backdoor attack framework (1) directly modifies (a small proportion of) local model weights to inject the backdoor trigger via sign flips; (2) jointly optimize the trigger pattern with the client model, thus is more persistent and stealthy for circumventing existing defenses. In a case study, we examine the strength and weaknesses of recent federated backdoor defenses from three major categories and provide suggestions to the practitioners when training federated models in practice.
    Simultaneously Learning Robust Audio Embeddings and balanced Hash codes for Query-by-Example. (arXiv:2211.11060v2 [eess.AS] UPDATED)
    Audio fingerprinting systems must efficiently and robustly identify query snippets in an extensive database. To this end, state-of-the-art systems use deep learning to generate compact audio fingerprints. These systems deploy indexing methods, which quantize fingerprints to hash codes in an unsupervised manner to expedite the search. However, these methods generate imbalanced hash codes, leading to their suboptimal performance. Therefore, we propose a self-supervised learning framework to compute fingerprints and balanced hash codes in an end-to-end manner to achieve both fast and accurate retrieval performance. We model hash codes as a balanced clustering process, which we regard as an instance of the optimal transport problem. Experimental results indicate that the proposed approach improves retrieval efficiency while preserving high accuracy, particularly at high distortion levels, compared to the competing methods. Moreover, our system is efficient and scalable in computational load and memory storage.
    Global mapping of fragmented rocks on the Moon with a neural network: Implications for the failure mode of rocks on airless surfaces. (arXiv:2301.08151v1 [astro-ph.EP])
    It has been recently recognized that the surface of sub-km asteroids in contact with the space environment is not fine-grained regolith but consists of centimeter to meter-scale rocks. Here we aim to understand how the rocky morphology of minor bodies react to the well known space erosion agents on the Moon. We deploy a neural network and map a total of ~130,000 fragmented boulders scattered across the lunar surface and visually identify a dozen different desintegration morphologies corresponding to different failure modes. We find that several fragmented boulder morphologies are equivalent to morphologies observed on asteroid Bennu, suggesting that these morphologies on the Moon and on asteroids are likely not diagnostic of their formation mechanism. Our findings suggest that the boulder fragmentation process is characterized by an internal weakening period with limited morphological signs of damage at rock scale until a sudden highly efficient impact shattering event occurs. In addition, we identify new morphologies such as breccia boulders with an advection-like erosion style. We publicly release the produced fractured boulder catalog along with this paper.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v3 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.
    A Deep Double Ritz Method (D$^2$RM) for solving Partial Differential Equations using Neural Networks. (arXiv:2211.03627v3 [math.NA] UPDATED)
    Residual minimization is a widely used technique for solving Partial Differential Equations in variational form. It minimizes the dual norm of the residual, which naturally yields a saddle-point (min-max) problem over the so-called trial and test spaces. In the context of neural networks, we can address this min-max approach by employing one network to seek the trial minimum, while another network seeks the test maximizers. However, the resulting method is numerically unstable as we approach the trial solution. To overcome this, we reformulate the residual minimization as an equivalent minimization of a Ritz functional fed by optimal test functions computed from another Ritz functional minimization. We call the resulting scheme the Deep Double Ritz Method (D$^2$RM), which combines two neural networks for approximating trial functions and optimal test functions along a nested double Ritz minimization strategy. Numerical results on different diffusion and convection problems support the robustness of our method, up to the approximation properties of the networks and the training capacity of the optimizers.
    Enhancing Deep Learning with Scenario-Based Override Rules: a Case Study. (arXiv:2301.08114v1 [cs.SE])
    Deep neural networks (DNNs) have become a crucial instrument in the software development toolkit, due to their ability to efficiently solve complex problems. Nevertheless, DNNs are highly opaque, and can behave in an unexpected manner when they encounter unfamiliar input. One promising approach for addressing this challenge is by extending DNN-based systems with hand-crafted override rules, which override the DNN's output when certain conditions are met. Here, we advocate crafting such override rules using the well-studied scenario-based modeling paradigm, which produces rules that are simple, extensible, and powerful enough to ensure the safety of the DNN, while also rendering the system more translucent. We report on two extensive case studies, which demonstrate the feasibility of the approach; and through them, propose an extension to scenario-based modeling, which facilitates its integration with DNN components. We regard this work as a step towards creating safer and more reliable DNN-based systems and models.
    Dual Personalization on Federated Recommendation. (arXiv:2301.08143v1 [cs.IR])
    Federated recommendation is a new Internet service architecture that aims to provide privacy-preserving recommendation services in federated settings. Existing solutions are used to combine distributed recommendation algorithms and privacy-preserving mechanisms. Thus it inherently takes the form of heavyweight models at the server and hinders the deployment of on-device intelligent models to end-users. This paper proposes a novel Personalized Federated Recommendation (PFedRec) framework to learn many user-specific lightweight models to be deployed on smart devices rather than a heavyweight model on a server. Moreover, we propose a new dual personalization mechanism to effectively learn fine-grained personalization on both users and items. The overall learning process is formulated into a unified federated optimization framework. Specifically, unlike previous methods that share exactly the same item embeddings across users in a federated system, dual personalization allows mild finetuning of item embeddings for each user to generate user-specific views for item representations which can be integrated into existing federated recommendation methods to gain improvements immediately. Experiments on multiple benchmark datasets have demonstrated the effectiveness of PFedRec and the dual personalization mechanism. Moreover, we provide visualizations and in-depth analysis of the personalization techniques in item embedding, which shed novel insights on the design of RecSys in federated settings.
    EPiC-GAN: Equivariant Point Cloud Generation for Particle Jets. (arXiv:2301.08128v1 [hep-ph])
    With the vast data-collecting capabilities of current and future high-energy collider experiments, there is an increasing demand for computationally efficient simulations. Generative machine learning models enable fast event generation, yet so far these approaches are largely constrained to fixed data structures and rigid detector geometries. In this paper, we introduce EPiC-GAN - equivariant point cloud generative adversarial network - which can produce point clouds of variable multiplicity. This flexible framework is based on deep sets and is well suited for simulating sprays of particles called jets. The generator and discriminator utilize multiple EPiC layers with an interpretable global latent vector. Crucially, the EPiC layers do not rely on pairwise information sharing between particles, which leads to a significant speed-up over graph- and transformer-based approaches with more complex relation diagrams. We demonstrate that EPiC-GAN scales well to large particle multiplicities and achieves high generation fidelity on benchmark jet generation tasks.
    Score-based Causal Representation Learning with Interventions. (arXiv:2301.08230v1 [stat.ML])
    This paper studies causal representation learning problem when the latent causal variables are observed indirectly through an unknown linear transformation. The objectives are: (i) recovering the unknown linear transformation (up to scaling and ordering), and (ii) determining the directed acyclic graph (DAG) underlying the latent variables. Since identifiable representation learning is impossible based on only observational data, this paper uses both observational and interventional data. The interventional data is generated under distinct single-node randomized hard and soft interventions. These interventions are assumed to cover all nodes in the latent space. It is established that the latent DAG structure can be recovered under soft randomized interventions via the following two steps. First, a set of transformation candidates is formed by including all inverting transformations corresponding to which the \emph{score} function of the transformed variables has the minimal number of coordinates that change between an interventional and the observational environment summed over all pairs. Subsequently, this set is distilled using a simple constraint to recover the latent DAG structure. For the special case of hard randomized interventions, with an additional hypothesis testing step, one can also uniquely recover the linear transformation, up to scaling and a valid causal ordering. These results generalize the recent results that either assume deterministic hard interventions or linear causal relationships in the latent space.
    Soft-labeling Strategies for Rapid Sub-Typing. (arXiv:2209.12684v2 [cs.LG] UPDATED)
    The challenge of labeling large example datasets for computer vision continues to limit the availability and scope of image repositories. This research provides a new method for automated data collection, curation, labeling, and iterative training with minimal human intervention for the case of overhead satellite imagery and object detection. The new operational scale effectively scanned an entire city (68 square miles) in grid search and yielded a prediction of car color from space observations. A partially trained yolov5 model served as an initial inference seed to output further, more refined model predictions in iterative cycles. Soft labeling here refers to accepting label noise as a potentially valuable augmentation to reduce overfitting and enhance generalized predictions to previously unseen test data. The approach takes advantage of a real-world instance where a cropped image of a car can automatically receive sub-type information as white or colorful from pixel values alone, thus completing an end-to-end pipeline without overdependence on human labor.
    A Convenient Infinite Dimensional Framework for Generative Adversarial Learning. (arXiv:2011.12087v4 [cs.LG] UPDATED)
    In recent years, generative adversarial networks (GANs) have demonstrated impressive experimental results while there are only a few works that foster statistical learning theory for GANs. In this work, we propose an infinite dimensional theoretical framework for generative adversarial learning. We assume that the probability density functions of the underlying measure are uniformly bounded, $k$-times $\alpha$-H\"{o}lder differentiable ($C^{k,\alpha}$) and uniformly bounded away from zero. Under these assumptions, we show that the Rosenblatt transformation induces an optimal generator, which is realizable in the hypothesis space of $C^{k,\alpha}$-generators. With a consistent definition of the hypothesis space of discriminators, we further show that the Jensen-Shannon divergence between the distribution induced by the generator from the adversarial learning procedure and the data generating distribution converges to zero. Under certain regularity assumptions on the density of the data generating process, we also provide rates of convergence based on chaining and concentration.
    Scalable Causal Structure Learning: Scoping Review of Traditional and Deep Learning Algorithms and New Opportunities in Biomedicine. (arXiv:2110.07785v2 [cs.LG] UPDATED)
    Causal structure learning refers to a process of identifying causal structures from observational data, and it can have multiple applications in biomedicine and health care. This paper provides a practical review and tutorial on scalable causal structure learning models with examples of real-world data to help health care audiences understand and apply them. We reviewed traditional (combinatorial and score-based methods) for causal structure discovery and machine learning-based schemes. We also highlighted recent developments in biomedicine where causal structure learning can be applied to discover structures such as gene networks, brain connectivity networks, and those in cancer epidemiology. We also compared the performance of traditional and machine learning-based algorithms for causal discovery over some benchmark data sets. Machine learning-based approaches, including deep learning, have many advantages over traditional approaches, such as scalability, including a greater number of variables, and potentially being applied in a wide range of biomedical applications, such as genetics, if sufficient data are available. Furthermore, these models are more flexible than traditional models and are poised to positively affect many applications in the future.
    An Analysis of Semantically-Aligned Speech-Text Embeddings. (arXiv:2204.01235v2 [cs.CL] UPDATED)
    Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study some intrinsic properties of a joint speech-text embedding space, constructed by minimizing the distance between paired utterance and transcription inputs in a teacher-student model setup, that are informative for several prominent use cases. We found that incorporating automatic speech recognition through both pretraining and multitask scenarios aid semantic alignment significantly, resulting in more tightly coupled embeddings. To analyse cross-modal embeddings we utilise a quantitative retrieval accuracy metric for semantic alignment, zero-shot classification for generalisability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.
    FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning. (arXiv:2108.06098v3 [cs.LG] UPDATED)
    In this work, we propose a communication-efficient parameterization, FedPara, for federated learning (FL) to overcome the burdens on frequent model uploads and downloads. Our method re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity. This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers, which is not achievable by the traditional low-rank methods. The efficiency of our method can be further improved by combining with other efficient FL optimizers. In addition, we extend our method to a personalized FL application, pFedPara, which separates parameters into global and local ones. We show that pFedPara outperforms competing personalized FL methods with more than three times fewer parameters.
    Dimensionality Reduction using Elastic Measures. (arXiv:2209.04933v3 [cs.LG] UPDATED)
    With the recent surge in big data analytics for hyper-dimensional data there is a renewed interest in dimensionality reduction techniques for machine learning applications. In order for these methods to improve performance gains and understanding of the underlying data, a proper metric needs to be identified. This step is often overlooked and metrics are typically chosen without consideration of the underlying geometry of the data. In this paper, we present a method for incorporating elastic metrics into the t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). We apply our method to functional data, which is uniquely characterized by rotations, parameterization, and scale. If these properties are ignored, they can lead to incorrect analysis and poor classification performance. Through our method we demonstrate improved performance on shape identification tasks for three benchmark data sets (MPEG-7, Car data set, and Plane data set of Thankoor), where we achieve 0.77, 0.95, and 1.00 F1 score, respectively.
    DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies. (arXiv:2301.08164v1 [cs.LG])
    We introduce an information-theoretic quantity with similar properties to mutual information that can be estimated from data without making explicit assumptions on the underlying distribution. This quantity is based on a recently proposed matrix-based entropy that uses the eigenvalues of a normalized Gram matrix to compute an estimate of the eigenvalues of an uncentered covariance operator in a reproducing kernel Hilbert space. We show that a difference of matrix-based entropies (DiME) is well suited for problems involving maximization of mutual information between random variables. While many methods for such tasks can lead to trivial solutions, DiME naturally penalizes such outcomes. We provide several examples of use cases for the proposed quantity including a multi-view representation learning problem where DiME is used to encourage learning a shared representation among views with high mutual information. We also show the versatility of DiME by using it as objective function for a variety of tasks.
    Hamiltonian Neural Networks with Automatic Symmetry Detection. (arXiv:2301.07928v1 [cs.LG])
    Recently, Hamiltonian neural networks (HNN) have been introduced to incorporate prior physical knowledge when learning the dynamical equations of Hamiltonian systems. Hereby, the symplectic system structure is preserved despite the data-driven modeling approach. However, preserving symmetries requires additional attention. In this research, we enhance the HNN with a Lie algebra framework to detect and embed symmetries in the neural network. This approach allows to simultaneously learn the symmetry group action and the total energy of the system. As illustrating examples, a pendulum on a cart and a two-body problem from astrodynamics are considered.
    AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. (arXiv:2301.08110v1 [cs.LG])
    Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
    DynInt: Dynamic Interaction Modeling for Large-scale Click-Through Rate Prediction. (arXiv:2301.08139v1 [cs.IR])
    Learning feature interactions is the key to success for the large-scale CTR prediction in Ads ranking and recommender systems. In industry, deep neural network-based models are widely adopted for modeling such problems. Researchers proposed various neural network architectures for searching and modeling the feature interactions in an end-to-end fashion. However, most methods only learn static feature interactions and have not fully leveraged deep CTR models' representation capacity. In this paper, we propose a new model: DynInt. By extending Polynomial-Interaction-Network (PIN), which learns higher-order interactions recursively to be dynamic and data-dependent, DynInt further derived two modes for modeling dynamic higher-order interactions: dynamic activation and dynamic parameter. In dynamic activation mode, we adaptively adjust the strength of learned interactions by instance-aware activation gating networks. In dynamic parameter mode, we re-parameterize the parameters by different formulations and dynamically generate the parameters by instance-aware parameter generation networks. Through instance-aware gating mechanism and dynamic parameter generation, we enable the PIN to model dynamic interaction for potential industry applications. We implement the proposed model and evaluate the model performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of DynInt over state-of-the-art models.
    Fast Vision Transformers with HiLo Attention. (arXiv:2205.13213v4 [cs.CV] UPDATED)
    Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.
    The secret role of undesired physical effects in accurate shape sensing with eccentric FBGs. (arXiv:2210.16316v2 [cs.LG] UPDATED)
    Fiber optic shape sensors have enabled unique advances in various navigation tasks, from medical tool tracking to industrial applications. Eccentric fiber Bragg gratings (FBG) are cheap and easy-to-fabricate shape sensors that are often interrogated with simple setups. However, using low-cost interrogation systems for such intensity-based quasi-distributed sensors introduces further complications to the sensor's signal. Therefore, eccentric FBGs have not been able to accurately estimate complex multi-bend shapes. Here, we present a novel technique to overcome these limitations and provide accurate and precise shape estimation in eccentric FBG sensors. We investigate the most important bending-induced effects in curved optical fibers that are usually eliminated in intensity-based fiber sensors. These effects contain shape deformation information with a higher spatial resolution that we are now able to extract using deep learning techniques. We design a deep learning model based on a convolutional neural network that is trained to predict shapes given the sensor's spectra. We also provide a visual explanation, highlighting wavelength elements whose intensities are more relevant in making shape predictions. These findings imply that deep learning techniques benefit from the bending-induced effects that impact the desired signal in a complex manner. This is the first step toward cheap yet accurate fiber shape sensing solutions.
    Learning programs by combining programs. (arXiv:2206.01614v2 [cs.LG] UPDATED)
    The goal of inductive logic programming is to induce a logic program (a set of logical rules) that generalises training examples. Inducing programs with many rules and literals is a major challenge. To tackle this challenge, we introduce an approach where we learn small non-separable programs and combine them. We implement our approach in a constraint-driven ILP system. Our approach can learn optimal and recursive programs and perform predicate invention. Our experiments on multiple domains, including game playing and program synthesis, show that our approach can drastically outperform existing approaches in terms of predictive accuracies and learning times, sometimes reducing learning times from over an hour to a few seconds.
    Diffusion-based Conditional ECG Generation with Structured State Space Models. (arXiv:2301.08227v1 [eess.SP])
    Synthetic data generation is a promising solution to address privacy issues with the distribution of sensitive health data. Recently, diffusion models have set new standards for generative models for different data modalities. Also very recently, structured state space models emerged as a powerful modeling paradigm to capture long-term dependencies in time series. We put forward SSSD-ECG, as the combination of these two technologies, for the generation of synthetic 12-lead electrocardiograms conditioned on more than 70 ECG statements. Due to a lack of reliable baselines, we also propose conditional variants of two state-of-the-art unconditional generative models. We thoroughly evaluate the quality of the generated samples, by evaluating pretrained classifiers on the generated data and by evaluating the performance of a classifier trained only on synthetic data, where SSSD-ECG clearly outperforms its GAN-based competitors. We demonstrate the soundness of our approach through further experiments, including conditional class interpolation and a clinical Turing test demonstrating the high quality of the SSSD-ECG samples across a wide range of conditions.
    Graph Data Augmentation for Graph Machine Learning: A Survey. (arXiv:2202.08871v2 [cs.LG] UPDATED)
    Data augmentation has recently seen increased interest in graph machine learning given its demonstrated ability to improve model performance and generalization by added training data. Despite this recent surge, the area is still relatively under-explored, due to the challenges brought by complex, non-Euclidean structure of graph data, which limits the direct analogizing of traditional augmentation operations on other types of image, video or text data. Our work aims to give a necessary and timely overview of existing graph data augmentation methods; notably, we present a comprehensive and systematic survey of graph data augmentation approaches, summarizing the literature in a structured manner. We first introduce three different taxonomies for categorizing graph data augmentation methods from the data, task, and learning perspectives, respectively. Next, we introduce recent advances in graph data augmentation, differentiated by their methodologies and applications. We conclude by outlining currently unsolved challenges and directions for future research. Overall, our work aims to clarify the landscape of existing literature in graph data augmentation and motivates additional work in this area, providing a helpful resource for researchers and practitioners in the broader graph machine learning domain. Additionally, we provide a continuously updated reading list at https://github.com/zhao-tong/graph-data-augmentation-papers.
    Job recommendations: benchmarking of collaborative filtering methods for classifieds. (arXiv:2301.07946v1 [cs.IR])
    Classifieds provide many challenges for recommendation methods, due to the limited information regarding users and items. In this paper, we explore recommendation methods for classifieds using the example of OLX Jobs. The goal of the paper is to benchmark different recommendation methods for jobs classifieds in order to improve advertisements' conversion rate and user satisfaction. In our research, we implemented methods that are scalable and represent different approaches to recommendation, namely ALS, LightFM, Prod2Vec, RP3beta, and SLIM. We performed a laboratory comparison of methods with regard to accuracy, diversity, and scalability (memory and time consumption during training and in prediction). Online A/B tests were also carried out by sending millions of messages with recommendations to evaluate models in a real-world setting. In addition, we have published the dataset that we created for the needs of our research. To the best of our knowledge, this is the first dataset of this kind. The dataset contains 65,502,201 events performed on OLX Jobs by 3,295,942 users, who interacted with (displayed, replied to, or bookmarked) 185,395 job ads in two weeks of 2020. We demonstrate that RP3beta, SLIM, and ALS perform significantly better than Prod2Vec and LightFM when tested in a laboratory setting. Online A/B tests also demonstrated that sending messages with recommendations generated by the ALS and RP3beta models increases the number of users contacting advertisers. Additionally, RP3beta had a 20% greater impact on this metric than ALS.
    Everything is Connected: Graph Neural Networks. (arXiv:2301.08210v1 [cs.LG])
    In many ways, graphs are the main modality of data we receive from nature. This is due to the fact that most of the patterns we see, both in natural and artificial systems, are elegantly representable using the language of graph structures. Prominent examples include molecules (represented as graphs of atoms and bonds), social networks and transportation networks. This potential has already been seen by key scientific and industrial groups, with already-impacted application areas including traffic forecasting, drug discovery, social network analysis and recommender systems. Further, some of the most successful domains of application for machine learning in previous years -- images, text and speech processing -- can be seen as special cases of graph representation learning, and consequently there has been significant exchange of information between these areas. The main aim of this short survey is to enable the reader to assimilate the key concepts in the area, and position graph representation learning in a proper context with related fields.
    Automated deep reinforcement learning for real-time scheduling strategy of multi-energy system integrated with post-carbon and direct-air carbon captured system. (arXiv:2301.07768v1 [eess.SY])
    The carbon-capturing process with the aid of CO2 removal technology (CDRT) has been recognised as an alternative and a prominent approach to deep decarbonisation. However, the main hindrance is the enormous energy demand and the economic implication of CDRT if not effectively managed. Hence, a novel deep reinforcement learning agent (DRL), integrated with an automated hyperparameter selection feature, is proposed in this study for the real-time scheduling of a multi-energy system coupled with CDRT. Post-carbon capture systems (PCCS) and direct-air capture systems (DACS) are considered CDRT. Various possible configurations are evaluated using real-time multi-energy data of a district in Arizona and CDRT parameters from manufacturers' catalogues and pilot project documentation. The simulation results validate that an optimised soft-actor critic (SAC) algorithm outperformed the TD3 algorithm due to its maximum entropy feature. We then trained four (4) SAC agents, equivalent to the number of considered case studies, using optimised hyperparameter values and deployed them in real time for evaluation. The results show that the proposed DRL agent can meet the prosumers' multi-energy demand and schedule the CDRT energy demand economically without specified constraints violation. Also, the proposed DRL agent outperformed rule-based scheduling by 23.65%. However, the configuration with PCCS and solid-sorbent DACS is considered the most suitable configuration with a high CO2 captured-released ratio of 38.54, low CO2 released indicator value of 2.53, and a 36.5% reduction in CDR cost due to waste heat utilisation and high absorption capacity of the selected sorbent. However, the adoption of CDRT is not economically viable at the current carbon price. Finally, we showed that CDRT would be attractive at a carbon price of 400-450USD/ton with the provision of tax incentives by the policymakers.
    Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity -- the Strongly Convex Case. (arXiv:2301.08039v1 [math.PR])
    In this article we consider sampling from log concave distributions in Hamiltonian setting, without assuming that the objective gradient is globally Lipschitz. We propose two algorithms based on monotone polygonal (tamed) Euler schemes, to sample from a target measure, and provide non-asymptotic 2-Wasserstein distance bounds between the law of the process of each algorithm and the target measure. Finally, we apply these results to bound the excess risk optimization error of the associated optimization problem.
    A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs. (arXiv:2301.08187v1 [stat.ML])
    U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.
    What's happening in your neighborhood? A Weakly Supervised Approach to Detect Local News. (arXiv:2301.08146v1 [cs.IR])
    Local news articles are a subset of news that impact users in a geographical area, such as a city, county, or state. Detecting local news (Step 1) and subsequently deciding its geographical location as well as radius of impact (Step 2) are two important steps towards accurate local news recommendation. Naive rule-based methods, such as detecting city names from the news title, tend to give erroneous results due to lack of understanding of the news content. Empowered by the latest development in natural language processing, we develop an integrated pipeline that enables automatic local news detection and content-based local news recommendations. In this paper, we focus on Step 1 of the pipeline, which highlights: (1) a weakly supervised framework incorporated with domain knowledge and auto data processing, and (2) scalability to multi-lingual settings. Compared with Stanford CoreNLP NER model, our pipeline has higher precision and recall evaluated on a real-world and human-labeled dataset. This pipeline has potential to more precise local news to users, helps local businesses get more exposure, and gives people more information about their neighborhood safety.
    On backpropagating Hessians through ODEs. (arXiv:2301.08085v1 [math.OC])
    We discuss the problem of numerically backpropagating Hessians through ordinary differential equations (ODEs) in various contexts and elucidate how different approaches may be favourable in specific situations. We discuss both theoretical and pragmatic aspects such as, respectively, bounds on computational effort and typical impact of framework overhead. Focusing on the approach of hand-implemented ODE-backpropagation, we develop the computation for the Hessian of orbit-nonclosure for a mechanical system. We also clarify the mathematical framework for extending the backward-ODE-evolution of the costate-equation to Hessians, in its most generic form. Some calculations, such as that of the Hessian for orbit non-closure, are performed in a language, defined in terms of a formal grammar, that we introduce to facilitate the tracking of intermediate quantities. As pedagogical examples, we discuss the Hessian of orbit-nonclosure for the higher dimensional harmonic oscillator and conceptually related problems in Newtonian gravitational theory. In particular, applying our approach to the figure-8 three-body orbit, we readily rediscover a distorted-figure-8 solution originally described by Sim\'o. Possible applications may include: improvements to training of `neural ODE'- type deep learning with second-order methods, numerical analysis of quantum corrections around classical paths, and, more broadly, studying options for adjusting an ODE's initial configuration such that the impact on some given objective function is small.
    Shapley Values with Uncertain Value Functions. (arXiv:2301.08086v1 [cs.LG])
    We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.
    Differentially Private Online Bayesian Estimation With Adaptive Truncation. (arXiv:2301.08202v1 [cs.LG])
    We propose a novel online and adaptive truncation method for differentially private Bayesian online estimation of a static parameter regarding a population. We assume that sensitive information from individuals is collected sequentially and the inferential aim is to estimate, on-the-fly, a static parameter regarding the population to which those individuals belong. We propose sequential Monte Carlo to perform online Bayesian estimation. When individuals provide sensitive information in response to a query, it is necessary to perturb it with privacy-preserving noise to ensure the privacy of those individuals. The amount of perturbation is proportional to the sensitivity of the query, which is determined usually by the range of the queried information. The truncation technique we propose adapts to the previously collected observations to adjust the query range for the next individual. The idea is that, based on previous observations, we can carefully arrange the interval into which the next individual's information is to be truncated before being perturbed with privacy-preserving noise. In this way, we aim to design predictive queries with small sensitivity, hence small privacy-preserving noise, enabling more accurate estimation while maintaining the same level of privacy. To decide on the location and the width of the interval, we use an exploration-exploitation approach a la Thompson sampling with an objective function based on the Fisher information of the generated observation. We show the merits of our methodology with numerical examples.
    Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions. (arXiv:2301.07966v1 [cs.LG])
    One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. This drop plausibly reflects a loss in model complexity, which we aim to avoid. In this work, we explore how sparsity also affects the geometry of the linear regions defined by a neural network, and consequently reduces the expected maximum number of linear regions based on the architecture. We observe that pruning affects accuracy similarly to how sparsity affects the number of linear regions and our proposed bound for the maximum number. Conversely, we find out that selecting the sparsity across layers to maximize our bound very often improves accuracy in comparison to pruning as much with the same sparsity in all layers, thereby providing us guidance on where to prune.
    BO-DBA: Query-Efficient Decision-Based Adversarial Attacks via Bayesian Optimization. (arXiv:2106.02732v2 [cs.LG] UPDATED)
    Decision-based attacks (DBA), wherein attackers perturb inputs to spoof learning algorithms by observing solely the output labels, are a type of severe adversarial attacks against Deep Neural Networks (DNNs) requiring minimal knowledge of attackers. State-of-the-art DBA attacks relying on zeroth-order gradient estimation require an excessive number of queries. Recently, Bayesian optimization (BO) has shown promising in reducing the number of queries in score-based attacks (SBA), in which attackers need to observe real-valued probability scores as outputs. However, extending BO to the setting of DBA is nontrivial because in DBA only output labels instead of real-valued scores, as needed by BO, are available to attackers. In this paper, we close this gap by proposing an efficient DBA attack, namely BO-DBA. Different from existing approaches, BO-DBA generates adversarial examples by searching so-called \emph{directions of perturbations}. It then formulates the problem as a BO problem that minimizes the real-valued distortion of perturbations. With the optimized perturbation generation process, BO-DBA converges much faster than the state-of-the-art DBA techniques. Experimental results on pre-trained ImageNet classifiers show that BO-DBA converges within 200 queries while the state-of-the-art DBA techniques need over 15,000 queries to achieve the same level of perturbation distortion. BO-DBA also shows similar attack success rates even as compared to BO-based SBA attacks but with less distortion.
    Hybrid thermal modeling of additive manufacturing processes using physics-informed neural networks for temperature prediction and parameter identification. (arXiv:2206.07756v2 [cs.LG] UPDATED)
    Understanding the thermal behavior of additive manufacturing (AM) processes is crucial for enhancing the quality control and enabling customized process design. Most purely physics-based computational models suffer from intensive computational costs and the need of calibrating unknown parameters, thus not suitable for online control and iterative design application. Data-driven models taking advantage of the latest developed computational tools can serve as a more efficient surrogate, but they are usually trained over a large amount of simulation data and often fail to effectively use small but high-quality experimental data. In this work, we developed a hybrid physics-based data-driven thermal modeling approach of AM processes using physics-informed neural networks. Specifically, partially observed temperature data measured from an infrared camera is combined with the physics laws to predict full-field temperature history and to discover unknown material and process parameters. In the numerical and experimental examples, the effectiveness of adding auxiliary training data and using the pretrained model on training efficiency and prediction accuracy, as well as the ability to identify unknown parameters with partially observed data, are demonstrated. The results show that the hybrid thermal model can effectively identify unknown parameters and capture the full-field temperature accurately, and thus it has the potential to be used in iterative process design and real-time process control of AM.
    Music Playlist Title Generation Using Artist Information. (arXiv:2301.08145v1 [cs.IR])
    Automatically generating or captioning music playlist titles given a set of tracks is of significant interest in music streaming services as customized playlists are widely used in personalized music recommendation, and well-composed text titles attract users and help their music discovery. We present an encoder-decoder model that generates a playlist title from a sequence of music tracks. While previous work takes track IDs as tokenized input for playlist title generation, we use artist IDs corresponding to the tracks to mitigate the issue from the long-tail distribution of tracks included in the playlist dataset. Also, we introduce a chronological data split method to deal with newly-released tracks in real-world scenarios. Comparing the track IDs and artist IDs as input sequences, we show that the artist-based approach significantly enhances the performance in terms of word overlap, semantic relevance, and diversity.
    GIPA++: A General Information Propagation Algorithm for Graph Learning. (arXiv:2301.08209v1 [cs.LG])
    Graph neural networks (GNNs) have been widely used in graph-structured data computation, showing promising performance in various applications such as node classification, link prediction, and network recommendation. Existing works mainly focus on node-wise correlation when doing weighted aggregation of neighboring nodes based on attention, such as dot product by the dense vectors of two nodes. This may cause conflicting noise in nodes to be propagated when doing information propagation. To solve this problem, we propose a General Information Propagation Algorithm (GIPA in short), which exploits more fine-grained information fusion including bit-wise and feature-wise correlations based on edge features in their propagation. Specifically, the bit-wise correlation calculates the element-wise attention weight through a multi-layer perceptron (MLP) based on the dense representations of two nodes and their edge; The feature-wise correlation is based on the one-hot representations of node attribute features for feature selection. We evaluate the performance of GIPA on the Open Graph Benchmark proteins (OGBN-proteins for short) dataset and the Alipay dataset of Alibaba. Experimental results reveal that GIPA outperforms the state-of-the-art models in terms of prediction accuracy, e.g., GIPA achieves an average ROC-AUC of $0.8901\pm 0.0011$, which is better than that of all the existing methods listed in the OGBN-proteins leaderboard.
    Building Concise Logical Patterns by Constraining Tsetlin Machine Clause Size. (arXiv:2301.08190v1 [cs.LG])
    Tsetlin machine (TM) is a logic-based machine learning approach with the crucial advantages of being transparent and hardware-friendly. While TMs match or surpass deep learning accuracy for an increasing number of applications, large clause pools tend to produce clauses with many literals (long clauses). As such, they become less interpretable. Further, longer clauses increase the switching activity of the clause logic in hardware, consuming more power. This paper introduces a novel variant of TM learning - Clause Size Constrained TMs (CSC-TMs) - where one can set a soft constraint on the clause size. As soon as a clause includes more literals than the constraint allows, it starts expelling literals. Accordingly, oversized clauses only appear transiently. To evaluate CSC-TM, we conduct classification, clustering, and regression experiments on tabular data, natural language text, images, and board games. Our results show that CSC-TM maintains accuracy with up to 80 times fewer literals. Indeed, the accuracy increases with shorter clauses for TREC, IMDb, and BBC Sports. After the accuracy peaks, it drops gracefully as the clause size approaches a single literal. We finally analyze CSC-TM power consumption and derive new convergence properties.
    Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation. (arXiv:2206.11489v2 [cs.LG] UPDATED)
    We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
    A Survey of Meta-Reinforcement Learning. (arXiv:2301.08028v1 [cs.LG])
    While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
    Geometric path augmentation for inference of sparsely observed stochastic nonlinear systems. (arXiv:2301.08102v1 [physics.data-an])
    Stochastic evolution equations describing the dynamics of systems under the influence of both deterministic and stochastic forces are prevalent in all fields of science. Yet, identifying these systems from sparse-in-time observations remains still a challenging endeavour. Existing approaches focus either on the temporal structure of the observations by relying on conditional expectations, discarding thereby information ingrained in the geometry of the system's invariant density; or employ geometric approximations of the invariant density, which are nevertheless restricted to systems with conservative forces. Here we propose a method that reconciles these two paradigms. We introduce a new data-driven path augmentation scheme that takes the local observation geometry into account. By employing non-parametric inference on the augmented paths, we can efficiently identify the deterministic driving forces of the underlying system for systems observed at low sampling rates.
    Learning to Rank by Causal Effects Without Data to Accurately Estimate Causal Effects. (arXiv:2206.12532v2 [stat.ML] UPDATED)
    Decision makers often want to identify the individuals for whom some intervention or treatment will be most effective in order to decide who to treat. In such cases, decision makers would ideally like to rank potential recipients of the treatment according to their individual causal effects. However, the available data may be completely inadequate to estimate causal effects accurately. We formalize a new assumption -- the rank preservation assumption (RPA) -- that defines when data are suitable to learn how to rank individuals according to their causal effects, even if the effects themselves cannot be accurately estimated. The RPA holds when there is data to estimate a scoring variable that induces the same ranking of individuals as the causal effect of interest. Some of the scoring variables we consider are confounded estimates, proxy causal effects, and non-causal quantities. We show that such scoring variables can work well for treatment assignment if the RPA is met, and potentially even better than using causal effects as scores. We also show that the RPA holds under conditions that are more general and weaker than the typical assumptions made in observational studies. Finally, we showcase how practitioners can apply and evaluate alternative scoring models (including non-causal models) to maximize the causal impact of their targeting decisions.
    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. (arXiv:2301.08243v1 [cs.CV])
    This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) predict several target blocks in the image, (b) sample target blocks with sufficiently large scale (occupying 15%-20% of the image), and (c) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strong downstream performance across a wide range of tasks requiring various levels of abstraction, from linear classification to object counting and depth prediction.
    TINKER: A framework for Open source Cyberthreat Intelligence. (arXiv:2102.05571v6 [cs.CR] UPDATED)
    Threat intelligence on malware attacks and campaigns is increasingly being shared with other security experts for a cost or for free. Other security analysts use this intelligence to inform them of indicators of compromise, attack techniques, and preventative actions. Security analysts prepare threat analysis reports after investigating an attack, an emerging cyber threat, or a recently discovered vulnerability. Collectively known as cyber threat intelligence (CTI), the reports are typically in an unstructured format and, therefore, challenging to integrate seamlessly into existing intrusion detection systems. This paper proposes a framework that uses the aggregated CTI for analysis and defense at scale. The information is extracted and stored in a structured format using knowledge graphs such that the semantics of the threat intelligence can be preserved and shared at scale with other security analysts. Specifically, we propose the first semi-supervised open-source knowledge graph-based framework, TINKER, to capture cyber threat information and its context. Following TINKER, we generate a Cyberthreat Intelligence Knowledge Graph (CTI-KG) and demonstrate the usage using different use cases.
    CEnt: An Entropy-based Model-agnostic Explainability Framework to Contrast Classifiers' Decisions. (arXiv:2301.07941v1 [cs.LG])
    Current interpretability methods focus on explaining a particular model's decision through present input features. Such methods do not inform the user of the sufficient conditions that alter these decisions when they are not desirable. Contrastive explanations circumvent this problem by providing explanations of the form "If the feature $X>x$, the output $Y$ would be different''. While different approaches are developed to find contrasts; these methods do not all deal with mutability and attainability constraints. In this work, we present a novel approach to locally contrast the prediction of any classifier. Our Contrastive Entropy-based explanation method, CEnt, approximates a model locally by a decision tree to compute entropy information of different feature splits. A graph, G, is then built where contrast nodes are found through a one-to-many shortest path search. Contrastive examples are generated from the shortest path to reflect feature splits that alter model decisions while maintaining lower entropy. We perform local sampling on manifold-like distances computed by variational auto-encoders to reflect data density. CEnt is the first non-gradient-based contrastive method generating diverse counterfactuals that do not necessarily exist in the training data while satisfying immutability (ex. race) and semi-immutability (ex. age can only change in an increasing direction). Empirical evaluation on four real-world numerical datasets demonstrates the ability of CEnt in generating counterfactuals that achieve better proximity rates than existing methods without compromising latency, feasibility, and attainability. We further extend CEnt to imagery data to derive visually appealing and useful contrasts between class labels on MNIST and Fashion MNIST datasets. Finally, we show how CEnt can serve as a tool to detect vulnerabilities of textual classifiers.
    Federated Automatic Differentiation. (arXiv:2301.07806v1 [cs.LG])
    Federated learning (FL) is a general framework for learning across heterogeneous clients while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (ie. entirely at each client, or entirely at the server), typically using automatic differentiation (AD) techniques. We propose a federated automatic differentiation (FAD) framework that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. In other words, FAD computes derivatives across communication boundaries. We show, in analogy with traditional AD, that FAD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these various modes of FAD, implying in particular that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using only these same primitives. We then show how FAD can be used to create algorithms that dynamically learn components of the algorithm itself. In particular, we show that FedAvg-style algorithms can exhibit significantly improved performance by using FAD to adjust the server optimization step automatically, or by using FAD to learn weighting schemes for computing weighted averages across clients.
    Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms. (arXiv:2301.08015v1 [cs.GT])
    Wide machine learning tasks can be formulated as non-convex multi-player games, where Nash equilibrium (NE) is an acceptable solution to all players, since no one can benefit from changing its strategy unilaterally. Attributed to the non-convexity, obtaining the existence condition of global NE is challenging, let alone designing theoretically guaranteed realization algorithms. This paper takes conjugate transformation to the formulation of non-convex multi-player games, and casts the complementary problem into a variational inequality (VI) problem with a continuous pseudo-gradient mapping. We then prove the existence condition of global NE: the solution to the VI problem satisfies a duality relation. Based on this VI formulation, we design a conjugate-based ordinary differential equation (ODE) to approach global NE, which is proved to have an exponential convergence rate. To make the dynamics more implementable, we further derive a discretized algorithm. We apply our algorithm to two typical scenarios: multi-player generalized monotone game and multi-player potential game. In the two settings, we prove that the step-size setting is required to be $\mathcal{O}(1/k)$ and $\mathcal{O}(1/\sqrt k)$ to yield the convergence rates of $\mathcal{O}(1/ k)$ and $\mathcal{O}(1/\sqrt k)$, respectively. Extensive experiments in robust neural network training and sensor localization are in full agreement with our theory.
    Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization. (arXiv:2301.07784v1 [cs.LG])
    Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state spaces.
    Neural Regression For Scale-Varying Targets. (arXiv:2211.07447v4 [cs.LG] UPDATED)
    In this work, we demonstrate that a major limitation of regression using a mean-squared error loss is its sensitivity to the scale of its targets. This makes learning settings consisting of target's whose values take on varying scales challenging. A recently-proposed alternative loss function, known as histogram loss, avoids this issue. However, its computational cost grows linearly with the number of buckets in the histogram, which renders prediction with real-valued targets intractable. To address this issue, we propose a novel approach to training deep learning models on real-valued regression targets, autoregressive regression, which learns a high-fidelity distribution by utilizing an autoregressive target decomposition. We demonstrate that this training objective allows us to solve regression tasks involving targets with different scales.
    Understanding the diffusion models by conditional expectations. (arXiv:2301.07882v1 [cs.LG])
    This paper provide several mathematical analyses of the diffusion model in machine learning. The drift term of the backwards sampling process is represented as a conditional expectation involving the data distribution and the forward diffusion. The training process aims to find such a drift function by minimizing the mean-squared residue related to the conditional expectation. Using small-time approximations of the Green's function of the forward diffusion, we show that the analytical mean drift function in DDPM and the score function in SGM asymptotically blow up in the final stages of the sampling process for singular data distributions such as those concentrated on lower-dimensional manifolds, and is therefore difficult to approximate by a network. To overcome this difficulty, we derive a new target function and associated loss, which remains bounded even for singular data distributions. We illustrate the theoretical findings with several numerical examples.
    Interval Reachability of Nonlinear Dynamical Systems with Neural Network Controllers. (arXiv:2301.07912v1 [eess.SY])
    This paper proposes a computationally efficient framework, based on interval analysis, for rigorous verification of nonlinear continuous-time dynamical systems with neural network controllers. Given a neural network, we use an existing verification algorithm to construct inclusion functions for its input-output behavior. Inspired by mixed monotone theory, we embed the closed-loop dynamics into a larger system using an inclusion function of the neural network and a decomposition function of the open-loop system. This embedding provides a scalable approach for safety analysis of the neural control loop while preserving the nonlinear structure of the system. We show that one can efficiently compute hyper-rectangular over-approximations of the reachable sets using a single trajectory of the embedding system. We design an algorithm to leverage this computational advantage through partitioning strategies, improving our reachable set estimates while balancing its runtime with tunable parameters. We demonstrate the performance of this algorithm through two case studies. First, we demonstrate this method's strength in complex nonlinear environments. Then, we show that our approach matches the performance of the state-of-the art verification algorithm for linear discretized systems.
    Learning-Rate-Free Learning by D-Adaptation. (arXiv:2301.07733v1 [cs.LG])
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of $D$ yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation).
    Continuously Reliable Detection of New-Normal Misinformation: Semantic Masking and Contrastive Smoothing in High-Density Latent Regions. (arXiv:2301.07981v1 [cs.LG])
    Toxic misinformation campaigns have caused significant societal harm, e.g., affecting elections and COVID-19 information awareness. Unfortunately, despite successes of (gold standard) retrospective studies of misinformation that confirmed their harmful effects after the fact, they arrive too late for timely intervention and reduction of such harm. By design, misinformation evades retrospective classifiers by exploiting two properties we call new-normal: (1) never-seen-before novelty that cause inescapable generalization challenges for previous classifiers, and (2) massive but short campaigns that end before they can be manually annotated for new classifier training. To tackle these challenges, we propose UFIT, which combines two techniques: semantic masking of strong signal keywords to reduce overfitting, and intra-proxy smoothness regularization of high-density regions in the latent space to improve reliability and maintain accuracy. Evaluation of UFIT on public new-normal misinformation data shows over 30% improvement over existing approaches on future (and unseen) campaigns. To the best of our knowledge, UFIT is the first successful effort to achieve such high level of generalization on new-normal misinformation data with minimal concession (1 to 5%) of accuracy compared to oracles trained with full knowledge of all campaigns.
    Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient. (arXiv:2301.08215v1 [cs.LG])
    A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.
    Multi-Agent Interplay in a Competitive Survival Environment. (arXiv:2301.08030v1 [cs.LG])
    Solving hard-exploration environments in an important challenge in Reinforcement Learning. Several approaches have been proposed and studied, such as Intrinsic Motivation, co-evolution of agents and tasks, and multi-agent competition. In particular, the interplay between multiple agents has proven to be capable of generating human-relevant emergent behaviour that would be difficult or impossible to learn in single-agent settings. In this work, an extensible competitive environment for multi-agent interplay was developed, which features realistic physics and human-relevant semantics. Moreover, several experiments on different variants of this environment were performed, resulting in some simple emergent strategies and concrete directions for future improvement. The content presented here is part of the author's thesis "Multi-Agent Interplay in a Competitive Survival Environment" for the Master's Degree in Artificial Intelligence and Robotics at Sapienza University of Rome, 2022.
    A Survey of Zero-shot Generalisation in Deep Reinforcement Learning. (arXiv:2111.09794v6 [cs.LG] UPDATED)
    The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We rely on a unifying formalism and terminology for discussing different ZSG problems, building upon previous works. We go on to categorise existing benchmarks for ZSG, as well as current methods for tackling these problems. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in ZSG, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for ZSG, and we recommend building benchmarks in underexplored problem settings such as offline RL ZSG and reward-function variation.
    Explainability in subgraphs-enhanced Graph Neural Networks. (arXiv:2209.07926v2 [cs.LG] UPDATED)
    Recently, subgraphs-enhanced Graph Neural Networks (SGNNs) have been introduced to enhance the expressive power of Graph Neural Networks (GNNs), which was proved to be not higher than the 1-dimensional Weisfeiler-Leman isomorphism test. The new paradigm suggests using subgraphs extracted from the input graph to improve the model's expressiveness, but the additional complexity exacerbates an already challenging problem in GNNs: explaining their predictions. In this work, we adapt PGExplainer, one of the most recent explainers for GNNs, to SGNNs. The proposed explainer accounts for the contribution of all the different subgraphs and can produce a meaningful explanation that humans can interpret. The experiments that we performed both on real and synthetic datasets show that our framework is successful in explaining the decision process of an SGNN on graph classification tasks.
    Concept Discovery for Fast Adapatation. (arXiv:2301.07850v1 [cs.LG])
    The advances in deep learning have enabled machine learning methods to outperform human beings in various areas, but it remains a great challenge for a well-trained model to quickly adapt to a new task. One promising solution to realize this goal is through meta-learning, also known as learning to learn, which has achieved promising results in few-shot learning. However, current approaches are still enormously different from human beings' learning process, especially in the ability to extract structural and transferable knowledge. This drawback makes current meta-learning frameworks non-interpretable and hard to extend to more complex tasks. We tackle this problem by introducing concept discovery to the few-shot learning problem, where we achieve more effective adaptation by meta-learning the structure among the data features, leading to a composite representation of the data. Our proposed method Concept-Based Model-Agnostic Meta-Learning (COMAML) has been shown to achieve consistent improvements in the structured data for both synthesized datasets and real-world datasets.
    Using CycleGANs to Generate Realistic STEM Images for Machine Learning. (arXiv:2301.07743v1 [cond-mat.mtrl-sci])
    The rise of automation and machine learning (ML) in electron microscopy has the potential to revolutionize materials research by enabling the autonomous collection and processing of vast amounts of atomic resolution data. However, a major challenge is developing ML models that can reliably and rapidly generalize to large data sets with varying experimental conditions. To overcome this challenge, we develop a cycle generative adversarial network (CycleGAN) that introduces a novel reciprocal space discriminator to augment simulated data with realistic, complex spatial frequency information learned from experimental data. This enables the CycleGAN to generate nearly indistinguishable images from real experimental data, while also providing labels for further ML applications. We demonstrate the effectiveness of this approach by training a fully convolutional network (FCN) to identify single atom defects in a large data set of 4.5 million atoms, which we collected using automated acquisition in an aberration-corrected scanning transmission electron microscope (STEM). Our approach yields highly adaptable FCNs that can adjust to dynamically changing experimental variables, such as lens aberrations, noise, and local contamination, with minimal manual intervention. This represents a significant step towards building fully autonomous approaches for harnessing microscopy big data.
    Position Regression for Unsupervised Anomaly Detection. (arXiv:2301.08064v1 [cs.CV])
    In recent years, anomaly detection has become an essential field in medical image analysis. Most current anomaly detection methods for medical images are based on image reconstruction. In this work, we propose a novel anomaly detection approach based on coordinate regression. Our method estimates the position of patches within a volume, and is trained only on data of healthy subjects. During inference, we can detect and localize anomalies by considering the error of the position estimate of a given patch. We apply our method to 3D CT volumes and evaluate it on patients with intracranial haemorrhages and cranial fractures. The results show that our method performs well in detecting these anomalies. Furthermore, we show that our method requires less memory than comparable approaches that involve image reconstruction. This is highly relevant for processing large 3D volumes, for instance, CT or MRI scans.
    A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems. (arXiv:2301.07799v1 [cs.LG])
    Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.
    A Nonstochastic Control Approach to Optimization. (arXiv:2301.07902v1 [cs.LG])
    Tuning optimizer hyperparameters, notably the learning rate to a particular optimization instance, is an important but nonconvex problem. Therefore iterative optimization methods such as hypergradient descent lack global optimality guarantees in general. We propose an online nonstochastic control methodology for mathematical optimization. The choice of hyperparameters for gradient based methods, including the learning rate, momentum parameter and preconditioner, is described as feedback control. The optimal solution to this control problem is shown to encompass preconditioned adaptive gradient methods with varying acceleration and momentum parameters. Although the optimal control problem by itself is nonconvex, we show how recent methods from online nonstochastic control based on convex relaxation can be applied to compete with the best offline solution. This guarantees that in episodic optimization, we converge to the best optimization method in hindsight.
    Identification, explanation and clinical evaluation of hospital patient subtypes. (arXiv:2301.08019v1 [cs.LG])
    We present a pipeline in which unsupervised machine learning techniques are used to automatically identify subtypes of hospital patients admitted between 2017 and 2021 in a large UK teaching hospital. With the use of state-of-the-art explainability techniques, the identified subtypes are interpreted and assigned clinical meaning. In parallel, clinicians assessed intra-cluster similarities and inter-cluster differences of the identified patient subtypes within the context of their clinical knowledge. By confronting the outputs of both automatic and clinician-based explanations, we aim to highlight the mutual benefit of combining machine learning techniques with clinical expertise.
    Discover governing differential equations from evolving systems. (arXiv:2301.07863v1 [physics.comp-ph])
    Discovering the governing equations of evolving systems from available observations is essential and challenging. However, current methods does not capture the situation that underlying system dynamics can be changed.Evolving systems are changing over time, which invariably changes with system status. Thus, finding the exact change points is critical. We propose an online modeling method capable of handling samples one by one sequentially by modeling streaming data instead of processing the entire dataset. The proposed method performs well in discovering ordinary differential equations, partial differential equations (PDEs), and high-dimensional PDEs from streaming data. The measurement generated from a changed system is distributed dissimilarly to before; hence, the difference can be identified by the proposed method. Our proposal performs well in identifying the change points and discovering governing differential equations in two evolving systems.
    Catapult Dynamics and Phase Transitions in Quadratic Nets. (arXiv:2301.07737v1 [cs.LG])
    Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.
    ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection. (arXiv:2301.07846v1 [cs.DC])
    With the increasing prevalence of scalable file systems in the context of High Performance Computing (HPC), the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many state-of-the-art methods for log-based anomaly detection, such as DeepLog, have encountered numerous challenges when applied to logs from many parallel file systems (PFSes), often due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method that clusters the temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, this approach aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability of a downstream sequence-based model to effectively learn the log patterns. The preliminary results of ClusterLog indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information but also its generalizability to different file systems' logs.
    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition. (arXiv:2301.07851v1 [cs.SD])
    In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
    Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control. (arXiv:2301.07876v1 [eess.SY])
    For a receding-horizon controller with a known system and with an approximate terminal value function, it is well-known that increasing the prediction horizon can improve its control performance. However, when the prediction model is inexact, a larger prediction horizon also causes propagation and accumulation of the prediction error. In this work, we aim to analyze the effect of the above trade-off between the modeling error, the terminal value function error, and the prediction horizon on the performance of a nominal receding-horizon linear quadratic (LQ) controller. By developing a novel perturbation result of the Riccati difference equation, a performance upper bound is obtained and suggests that for many cases, the prediction horizon should be either 1 or infinity to improve the control performance, depending on the relative difference between the modeling error and the terminal value function error. The obtained suboptimality performance bound is also applied to provide end-to-end performance guarantees, e.g., regret bounds, for nominal receding-horizon LQ controllers in a learning-based setting.
    HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble. (arXiv:2301.07794v1 [cs.LG])
    Ensemble learning has gain attention in resent deep learning research as a way to further boost the accuracy and generalizability of deep neural network (DNN) models. Recent ensemble training method explores different training algorithms or settings on multiple sub-models with the same model architecture, which lead to significant burden on memory and computation cost of the ensemble model. Meanwhile, the heurtsically induced diversity may not lead to significant performance gain. We propose a new prespective on exploring the intrinsic diversity within a model architecture to build efficient DNN ensemble. We make an intriguing observation that pruning and quantization, while both leading to efficient model architecture at the cost of small accuracy drop, leads to distinct behavior in the decision boundary. To this end, we propose Heterogeneously Compressed Ensemble (HCE), where we build an efficient ensemble with the pruned and quantized variants from a pretrained DNN model. An diversity-aware training objective is proposed to further boost the performance of the HCE ensemble. Experiemnt result shows that HCE achieves significant improvement in the efficiency-accuracy tradeoff comparing to both traditional DNN ensemble training methods and previous model compression methods.
    FE-TCM: Filter-Enhanced Transformer Click Model for Web Search. (arXiv:2301.07854v1 [cs.IR])
    Constructing click models and extracting implicit relevance feedback information from the interaction between users and search engines are very important to improve the ranking of search results. Using neural network to model users' click behaviors has become one of the effective methods to construct click models. In this paper, We use Transformer as the backbone network of feature extraction, add filter layer innovatively, and propose a new Filter-Enhanced Transformer Click Model (FE-TCM) for web search. Firstly, in order to reduce the influence of noise on user behavior data, we use the learnable filters to filter log noise. Secondly, following the examination hypothesis, we model the attraction estimator and examination predictor respectively to output the attractiveness scores and examination probabilities. A novel transformer model is used to learn the deeper representation among different features. Finally, we apply the combination functions to integrate attractiveness scores and examination probabilities into the click prediction. From our experiments on two real-world session datasets, it is proved that FE-TCM outperforms the existing click models for the click prediction.
    A Scalable Finite Difference Method for Deep Reinforcement Learning. (arXiv:2210.07487v2 [cs.LG] UPDATED)
    Several low-bandwidth distributable black-box optimization algorithms in the family of finite differences such as Evolution Strategies have recently been shown to perform nearly as well as tailored Reinforcement Learning methods in some Reinforcement Learning domains. One shortcoming of these black-box methods is that they must collect information about the structure of the return function at every update, and can often employ only information drawn from a distribution centered around the current parameters. As a result, when these algorithms are distributed across many machines, a significant portion of total runtime may be spent with many machines idle, waiting for a final return and then for an update to be calculated. In this work we introduce a novel method to use older data in finite difference algorithms, which produces a scalable algorithm that avoids significant idle time or wasted computation.
    RNAS-CL: Robust Neural Architecture Search by Cross-Layer Knowledge Distillation. (arXiv:2301.08092v1 [cs.CV])
    Deep Neural Networks are vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the driving tools of deep neural networks, demonstrates superior performance in prediction accuracy in various machine learning applications. However, it is unclear how it performs against adversarial attacks. Given the presence of a robust teacher, it would be interesting to investigate if NAS would produce robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through cross-layer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student/teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental result evidences the effectiveness of RNAS-CL and shows that RNAS-CL produces small and robust neural architecture.
    SpotHitPy: A Study For ML-Based Song Hit Prediction Using Spotify. (arXiv:2301.07978v1 [cs.SD])
    In this study, we approached the Hit Song Prediction problem, which aims to predict which songs will become Billboard hits. We gathered a dataset of nearly 18500 hit and non-hit songs and extracted their audio features using the Spotify Web API. We test four machine-learning models on our dataset. We were able to predict the Billboard success of a song with approximately 86\% accuracy. The most succesful algorithms were Random Forest and Support Vector Machine.
    General Greedy De-bias Learning. (arXiv:2112.10572v5 [cs.LG] UPDATED)
    Neural networks often make predictions relying on the spurious correlations from the datasets rather than the intrinsic properties of the task of interest, facing sharp degradation on out-of-distribution (OOD) test data. Existing de-bias learning frameworks try to capture specific dataset bias by annotations but they fail to handle complicated OOD scenarios. Others implicitly identify the dataset bias by special design low capability biased models or losses, but they degrade when the training and testing data are from the same distribution. In this paper, we propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model. The base model is encouraged to focus on examples that are hard to solve with biased models, thus remaining robust against spurious correlations in the test stage. GGD largely improves models' OOD generalization ability on various tasks, but sometimes over-estimates the bias level and degrades on the in-distribution test. We further re-analyze the ensemble process of GGD and introduce the Curriculum Regularization inspired by curriculum learning, which achieves a good trade-off between in-distribution and out-of-distribution performance. Extensive experiments on image classification, adversarial question answering, and visual question answering demonstrate the effectiveness of our method. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
    Learning Quantum Processes with Memory -- Quantum Recurrent Neural Networks. (arXiv:2301.08167v1 [quant-ph])
    Recurrent neural networks play an important role in both research and industry. With the advent of quantum machine learning, the quantisation of recurrent neural networks has become recently relevant. We propose fully quantum recurrent neural networks, based on dissipative quantum neural networks, capable of learning general causal quantum automata. A quantum training algorithm is proposed and classical simulations for the case of product outputs with the fidelity as cost function are carried out. We thereby demonstrate the potential of these algorithms to learn complex quantum processes with memory in terms of the exemplary delay channel, the time evolution of quantum states governed by a time-dependent Hamiltonian, and high- and low-frequency noise mitigation. Numerical simulations indicate that our quantum recurrent neural networks exhibit a striking ability to generalise from small training sets.
    Augmenting a Physics-Informed Neural Network for the 2D Burgers Equation by Addition of Solution Data Points. (arXiv:2301.07824v1 [physics.flu-dyn])
    We implement a Physics-Informed Neural Network (PINN) for solving the two-dimensional Burgers equations. This type of model can be trained with no previous knowledge of the solution; instead, it relies on evaluating the governing equations of the system in points of the physical domain. It is also possible to use points with a known solution during training. In this paper, we compare PINNs trained with different amounts of governing equation evaluation points and known solution points. Comparing models that were trained purely with known solution points to those that have also used the governing equations, we observe an improvement in the overall observance of the underlying physics in the latter. We also investigate how changing the number of each type of point affects the resulting models differently. Finally, we argue that the addition of the governing equations during training may provide a way to improve the overall performance of the model without relying on additional data, which is especially important for situations where the number of known solution points is limited.
    Improving Machine Translation with Phrase Pair Injection and Corpus Filtering. (arXiv:2301.08008v1 [cs.CL])
    In this paper, we show that the combination of Phrase Pair Injection and Corpus Filtering boosts the performance of Neural Machine Translation (NMT) systems. We extract parallel phrases and sentences from the pseudo-parallel corpus and augment it with the parallel corpus to train the NMT models. With the proposed approach, we observe an improvement in the Machine Translation (MT) system for 3 low-resource language pairs, Hindi-Marathi, English-Marathi, and English-Pashto, and 6 translation directions by up to 2.7 BLEU points, on the FLORES test data. These BLEU score improvements are over the models trained using the whole pseudo-parallel corpus augmented with the parallel corpus.
    WaveMix: A Resource-efficient Neural Network for Image Analysis. (arXiv:2205.14375v3 [cs.CV] UPDATED)
    To allow image analysis in resource-constrained scenarios without compromising generalizability, we introduce WaveMix -- a novel and flexible neural framework that reduces the GPU RAM (memory) and compute (latency) compared to CNNs and transformers. In addition to using convolutional layers that exploit shift-invariant image statistics, the proposed framework uses multi-level two-dimensional discrete wavelet transform (2D-DWT) modules to exploit scale-invariance and edge sparseness, which gives it the following advantages. Firstly, the fixed weights of wavelet modules do not add to the parameter count while reorganizing information based on these image priors. Secondly, the wavelet modules scale the spatial extents of feature maps by integral powers of $\frac{1}{2}\times\frac{1}{2}$, which reduces the memory and latency required for forward and backward passes. Finally, a multi-level 2D-DWT leads to a quicker expansion of the receptive field per layer than pooling (which we do not use) and it is a more effective spatial token mixer. WaveMix also generalizes better than other token mixing models, such as ConvMixer, MLP-Mixer, PoolFormer, random filters, and Fourier basis, because the wavelet transform is much better suited for image decomposition and spatial token mixing. WaveMix is a flexible model that can perform well on multiple image tasks without needing architectural modifications. WaveMix achieves a semantic segmentation mIoU of 83% on the Cityscapes validation set outperforming transformer and CNN-based architectures. We also demonstrate the advantages of WaveMix for classification on multiple datasets and show that WaveMix establishes new state-of-the-results in Places-365, EMNIST, and iNAT-mini datasets.
    PDFormer: Propagation Delay-aware Dynamic Long-range Transformer for Traffic Flow Prediction. (arXiv:2301.07945v1 [cs.LG])
    As a core technology of Intelligent Transportation System, traffic flow prediction has a wide range of applications. The fundamental challenge in traffic flow prediction is to effectively model the complex spatial-temporal dependencies in traffic data. Spatial-temporal Graph Neural Network (GNN) models have emerged as one of the most promising methods to solve this problem. However, GNN-based models have three major limitations for traffic prediction: i) Most methods model spatial dependencies in a static manner, which limits the ability to learn dynamic urban traffic patterns; ii) Most methods only consider short-range spatial information and are unable to capture long-range spatial dependencies; iii) These methods ignore the fact that the propagation of traffic conditions between locations has a time delay in traffic systems. To this end, we propose a novel Propagation Delay-aware dynamic long-range transFormer, namely PDFormer, for accurate traffic flow prediction. Specifically, we design a spatial self-attention module to capture the dynamic spatial dependencies. Then, two graph masking matrices are introduced to highlight spatial dependencies from short- and long-range views. Moreover, a traffic delay-aware feature transformation module is proposed to empower PDFormer with the capability of explicitly modeling the time delay of spatial information propagation. Extensive experimental results on six real-world public traffic datasets show that our method can not only achieve state-of-the-art performance but also exhibit competitive computational efficiency. Moreover, we visualize the learned spatial-temporal attention map to make our model highly interpretable.
    Learning Generalizable Models for Vehicle Routing Problems via Knowledge Distillation. (arXiv:2210.07686v2 [cs.LG] UPDATED)
    Recent neural methods for vehicle routing problems always train and test the deep models on the same instance distribution (i.e., uniform). To tackle the consequent cross-distribution generalization concerns, we bring the knowledge distillation to this field and propose an Adaptive Multi-Distribution Knowledge Distillation (AMDKD) scheme for learning more generalizable deep models. Particularly, our AMDKD leverages various knowledge from multiple teachers trained on exemplar distributions to yield a light-weight yet generalist student model. Meanwhile, we equip AMDKD with an adaptive strategy that allows the student to concentrate on difficult distributions, so as to absorb hard-to-master knowledge more effectively. Extensive experimental results show that, compared with the baseline neural methods, our AMDKD is able to achieve competitive results on both unseen in-distribution and out-of-distribution instances, which are either randomly synthesized or adopted from benchmark datasets (i.e., TSPLIB and CVRPLIB). Notably, our AMDKD is generic, and consumes less computational resources for inference.
    An SDE for Modeling SAM: Theory and Insights. (arXiv:2301.08203v1 [cs.LG])
    We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and its unnormalized variant USAM, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the step size). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones - by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that perhaps unexpectedly SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
    Fully Elman Neural Network: A Novel Deep Recurrent Neural Network Optimized by an Improved Harris Hawks Algorithm for Classification of Pulmonary Arterial Wedge Pressure. (arXiv:2301.07710v1 [cs.LG])
    Heart failure (HF) is one of the most prevalent life-threatening cardiovascular diseases in which 6.5 million people are suffering in the USA and more than 23 million worldwide. Mechanical circulatory support of HF patients can be achieved by implanting a left ventricular assist device (LVAD) into HF patients as a bridge to transplant, recovery or destination therapy and can be controlled by measurement of normal and abnormal pulmonary arterial wedge pressure (PAWP). While there are no commercial long-term implantable pressure sensors to measure PAWP, real-time non-invasive estimation of abnormal and normal PAWP becomes vital. In this work, first an improved Harris Hawks optimizer algorithm called HHO+ is presented and tested on 24 unimodal and multimodal benchmark functions. Second, a novel fully Elman neural network (FENN) is proposed to improve the classification performance. Finally, four novel 18-layer deep learning methods of convolutional neural networks (CNNs) with multi-layer perceptron (CNN-MLP), CNN with Elman neural networks (CNN-ENN), CNN with fully Elman neural networks (CNN-FENN), and CNN with fully Elman neural networks optimized by HHO+ algorithm (CNN-FENN-HHO+) for classification of abnormal and normal PAWP using estimated HVAD pump flow were developed and compared. The estimated pump flow was derived by a non-invasive method embedded into the commercial HVAD controller. The proposed methods are evaluated on an imbalanced clinical dataset using 5-fold cross-validation. The proposed CNN-FENN-HHO+ method outperforms the proposed CNN-MLP, CNN-ENN and CNN-FENN methods and improved the classification performance metrics across 5-fold cross-validation. The proposed methods can reduce the likelihood of hazardous events like pulmonary congestion and ventricular suction for HF patients and notify identified abnormal cases to the hospital, clinician and cardiologist.
    Skeleton Clustering: Dimension-Free Density-based Clustering. (arXiv:2104.10770v2 [stat.ML] UPDATED)
    We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
    Emergence of the SVD as an interpretable factorization in deep learning for inverse problems. (arXiv:2301.07820v1 [cs.LG])
    We demonstrate the emergence of weight matrix singular value decomposition (SVD) in interpreting neural networks (NNs) for parameter estimation from noisy signals. The SVD appears naturally as a consequence of initial application of a descrambling transform - a recently-developed technique for addressing interpretability in NNs \cite{amey2021neural}. We find that within the class of noisy parameter estimation problems, the SVD may be the means by which networks memorize the signal model. We substantiate our theoretical findings with empirical evidence from both linear and non-linear settings. Our results also illuminate the connections between a mathematical theory of semantic development \cite{saxe2019mathematical} and neural network interpretability.
    Spatio-temporal neural structural causal models for bike flow prediction. (arXiv:2301.07843v1 [cs.LG])
    As a representative of public transportation, the fundamental issue of managing bike-sharing systems is bike flow prediction. Recent methods overemphasize the spatio-temporal correlations in the data, ignoring the effects of contextual conditions on the transportation system and the inter-regional timevarying causality. In addition, due to the disturbance of incomplete observations in the data, random contextual conditions lead to spurious correlations between data and features, making the prediction of the model ineffective in special scenarios. To overcome this issue, we propose a Spatio-temporal Neural Structure Causal Model(STNSCM) from the perspective of causality. First, we build a causal graph to describe the traffic prediction, and further analyze the causal relationship between the input data, contextual conditions, spatiotemporal states, and prediction results. Second, we propose to apply the frontdoor criterion to eliminate confounding biases in the feature extraction process. Finally, we propose a counterfactual representation reasoning module to extrapolate the spatio-temporal state under the factual scenario to future counterfactual scenarios to improve the prediction performance. Experiments on real-world datasets demonstrate the superior performance of our model, especially its resistance to fluctuations caused by the external environment. The source code and data will be released.
  • Open

    Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms. (arXiv:2301.08015v1 [cs.GT])
    Wide machine learning tasks can be formulated as non-convex multi-player games, where Nash equilibrium (NE) is an acceptable solution to all players, since no one can benefit from changing its strategy unilaterally. Attributed to the non-convexity, obtaining the existence condition of global NE is challenging, let alone designing theoretically guaranteed realization algorithms. This paper takes conjugate transformation to the formulation of non-convex multi-player games, and casts the complementary problem into a variational inequality (VI) problem with a continuous pseudo-gradient mapping. We then prove the existence condition of global NE: the solution to the VI problem satisfies a duality relation. Based on this VI formulation, we design a conjugate-based ordinary differential equation (ODE) to approach global NE, which is proved to have an exponential convergence rate. To make the dynamics more implementable, we further derive a discretized algorithm. We apply our algorithm to two typical scenarios: multi-player generalized monotone game and multi-player potential game. In the two settings, we prove that the step-size setting is required to be $\mathcal{O}(1/k)$ and $\mathcal{O}(1/\sqrt k)$ to yield the convergence rates of $\mathcal{O}(1/ k)$ and $\mathcal{O}(1/\sqrt k)$, respectively. Extensive experiments in robust neural network training and sensor localization are in full agreement with our theory.
    Skeleton Clustering: Dimension-Free Density-based Clustering. (arXiv:2104.10770v2 [stat.ML] UPDATED)
    We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
    Shapley Values with Uncertain Value Functions. (arXiv:2301.08086v1 [cs.LG])
    We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.
    Score-based Causal Representation Learning with Interventions. (arXiv:2301.08230v1 [stat.ML])
    This paper studies causal representation learning problem when the latent causal variables are observed indirectly through an unknown linear transformation. The objectives are: (i) recovering the unknown linear transformation (up to scaling and ordering), and (ii) determining the directed acyclic graph (DAG) underlying the latent variables. Since identifiable representation learning is impossible based on only observational data, this paper uses both observational and interventional data. The interventional data is generated under distinct single-node randomized hard and soft interventions. These interventions are assumed to cover all nodes in the latent space. It is established that the latent DAG structure can be recovered under soft randomized interventions via the following two steps. First, a set of transformation candidates is formed by including all inverting transformations corresponding to which the \emph{score} function of the transformed variables has the minimal number of coordinates that change between an interventional and the observational environment summed over all pairs. Subsequently, this set is distilled using a simple constraint to recover the latent DAG structure. For the special case of hard randomized interventions, with an additional hypothesis testing step, one can also uniquely recover the linear transformation, up to scaling and a valid causal ordering. These results generalize the recent results that either assume deterministic hard interventions or linear causal relationships in the latent space.
    Learning to Rank by Causal Effects Without Data to Accurately Estimate Causal Effects. (arXiv:2206.12532v2 [stat.ML] UPDATED)
    Decision makers often want to identify the individuals for whom some intervention or treatment will be most effective in order to decide who to treat. In such cases, decision makers would ideally like to rank potential recipients of the treatment according to their individual causal effects. However, the available data may be completely inadequate to estimate causal effects accurately. We formalize a new assumption -- the rank preservation assumption (RPA) -- that defines when data are suitable to learn how to rank individuals according to their causal effects, even if the effects themselves cannot be accurately estimated. The RPA holds when there is data to estimate a scoring variable that induces the same ranking of individuals as the causal effect of interest. Some of the scoring variables we consider are confounded estimates, proxy causal effects, and non-causal quantities. We show that such scoring variables can work well for treatment assignment if the RPA is met, and potentially even better than using causal effects as scores. We also show that the RPA holds under conditions that are more general and weaker than the typical assumptions made in observational studies. Finally, we showcase how practitioners can apply and evaluate alternative scoring models (including non-causal models) to maximize the causal impact of their targeting decisions.
    Everything is Connected: Graph Neural Networks. (arXiv:2301.08210v1 [cs.LG])
    In many ways, graphs are the main modality of data we receive from nature. This is due to the fact that most of the patterns we see, both in natural and artificial systems, are elegantly representable using the language of graph structures. Prominent examples include molecules (represented as graphs of atoms and bonds), social networks and transportation networks. This potential has already been seen by key scientific and industrial groups, with already-impacted application areas including traffic forecasting, drug discovery, social network analysis and recommender systems. Further, some of the most successful domains of application for machine learning in previous years -- images, text and speech processing -- can be seen as special cases of graph representation learning, and consequently there has been significant exchange of information between these areas. The main aim of this short survey is to enable the reader to assimilate the key concepts in the area, and position graph representation learning in a proper context with related fields.
    Diffusion-based Conditional ECG Generation with Structured State Space Models. (arXiv:2301.08227v1 [eess.SP])
    Synthetic data generation is a promising solution to address privacy issues with the distribution of sensitive health data. Recently, diffusion models have set new standards for generative models for different data modalities. Also very recently, structured state space models emerged as a powerful modeling paradigm to capture long-term dependencies in time series. We put forward SSSD-ECG, as the combination of these two technologies, for the generation of synthetic 12-lead electrocardiograms conditioned on more than 70 ECG statements. Due to a lack of reliable baselines, we also propose conditional variants of two state-of-the-art unconditional generative models. We thoroughly evaluate the quality of the generated samples, by evaluating pretrained classifiers on the generated data and by evaluating the performance of a classifier trained only on synthetic data, where SSSD-ECG clearly outperforms its GAN-based competitors. We demonstrate the soundness of our approach through further experiments, including conditional class interpolation and a clinical Turing test demonstrating the high quality of the SSSD-ECG samples across a wide range of conditions.
    Equivalence relations and $L^p$ distances between time series with application to the Black Summer Australian bushfires. (arXiv:2002.02592v2 [stat.ME] UPDATED)
    This paper introduces a new framework of algebraic equivalence relations between time series and new distance metrics between them, then applies these to investigate the Australian ``Black Summer'' bushfire season of 2019-2020. First, we introduce a general framework for defining equivalence between time series, heuristically intended to be equivalent if they differ only up to noise. Our first specific implementation is based on using change point algorithms and comparing statistical quantities such as mean or variance in stationary segments. We thus derive the existence of such equivalence relations on the space of time series, such that the quotient spaces can be equipped with a metrizable topology. Next, we illustrate specifically how to define and compute such distances among a collection of time series and perform clustering and additional analysis thereon. Then, we apply these insights to analyze air quality data across New South Wales, Australia, during the 2019-2020 bushfires. There, we investigate structural similarity with respect to this data and identify locations that were impacted anonymously by the fires relative to their location. This may have implications regarding the appropriate management of resources to avoid gaps in the defense against future fires.
    Semiparametric inference using fractional posteriors. (arXiv:2301.08158v1 [math.ST])
    We establish a general Bernstein--von Mises theorem for approximately linear semiparametric functionals of fractional posterior distributions based on nonparametric priors. This is illustrated in a number of nonparametric settings and for different classes of prior distributions, including Gaussian process priors. We show that fractional posterior credible sets can provide reliable semiparametric uncertainty quantification, but have inflated size. To remedy this, we further propose a \textit{shifted-and-rescaled} fractional posterior set that is an efficient confidence set having optimal size under regularity conditions. As part of our proofs, we also refine existing contraction rate results for fractional posteriors by sharpening the dependence of the rate on the fractional exponent.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v3 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.  ( 2 min )
    Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient. (arXiv:2301.08215v1 [cs.LG])
    A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.  ( 2 min )
    Differentially Private Online Bayesian Estimation With Adaptive Truncation. (arXiv:2301.08202v1 [cs.LG])
    We propose a novel online and adaptive truncation method for differentially private Bayesian online estimation of a static parameter regarding a population. We assume that sensitive information from individuals is collected sequentially and the inferential aim is to estimate, on-the-fly, a static parameter regarding the population to which those individuals belong. We propose sequential Monte Carlo to perform online Bayesian estimation. When individuals provide sensitive information in response to a query, it is necessary to perturb it with privacy-preserving noise to ensure the privacy of those individuals. The amount of perturbation is proportional to the sensitivity of the query, which is determined usually by the range of the queried information. The truncation technique we propose adapts to the previously collected observations to adjust the query range for the next individual. The idea is that, based on previous observations, we can carefully arrange the interval into which the next individual's information is to be truncated before being perturbed with privacy-preserving noise. In this way, we aim to design predictive queries with small sensitivity, hence small privacy-preserving noise, enabling more accurate estimation while maintaining the same level of privacy. To decide on the location and the width of the interval, we use an exploration-exploitation approach a la Thompson sampling with an objective function based on the Fisher information of the generated observation. We show the merits of our methodology with numerical examples.  ( 2 min )
    A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs. (arXiv:2301.08187v1 [stat.ML])
    U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.  ( 2 min )
    Robust Gaussian Process Regression with Huber Likelihood. (arXiv:2301.07858v1 [stat.AP])
    Gaussian process regression in its most simplified form assumes normal homoscedastic noise and utilizes analytically tractable mean and covariance functions of predictive posterior distribution using Gaussian conditioning. Its hyperparameters are estimated by maximizing the evidence, commonly known as type II maximum likelihood estimation. Unfortunately, Bayesian inference based on Gaussian likelihood is not robust to outliers, which are often present in the observational training data sets. To overcome this problem, we propose a robust process model in the Gaussian process framework with the likelihood of observed data expressed as the Huber probability distribution. The proposed model employs weights based on projection statistics to scale residuals and bound the influence of vertical outliers and bad leverage points on the latent functions estimates while exhibiting a high statistical efficiency at the Gaussian and thick tailed noise distributions. The proposed method is demonstrated by two real world problems and two numerical examples using datasets with additive errors following thick tailed distributions such as Students t, Laplace, and Cauchy distribution.  ( 2 min )
    Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity -- the Strongly Convex Case. (arXiv:2301.08039v1 [math.PR])
    In this article we consider sampling from log concave distributions in Hamiltonian setting, without assuming that the objective gradient is globally Lipschitz. We propose two algorithms based on monotone polygonal (tamed) Euler schemes, to sample from a target measure, and provide non-asymptotic 2-Wasserstein distance bounds between the law of the process of each algorithm and the target measure. Finally, we apply these results to bound the excess risk optimization error of the associated optimization problem.  ( 2 min )
    Learning-Rate-Free Learning by D-Adaptation. (arXiv:2301.07733v1 [cs.LG])
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of $D$ yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation).  ( 2 min )
    Rates of convergence for density estimation with generative adversarial networks. (arXiv:2102.00199v3 [math.ST] UPDATED)
    In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove a sharp oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate. We also study the rates of convergence in the context of nonparametric density estimation. In particular, we show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. To the best of our knowledge, this is the first result in the literature on density estimation using vanilla GANs with JS convergence rates faster than $n^{-1/2}$ in the regime $\beta > d/2$. Moreover, we show that the obtained rate is minimax optimal (up to logarithmic factors) for the considered class of densities.  ( 2 min )
    Catapult Dynamics and Phase Transitions in Quadratic Nets. (arXiv:2301.07737v1 [cs.LG])
    Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.  ( 2 min )
    Understanding the diffusion models by conditional expectations. (arXiv:2301.07882v1 [cs.LG])
    This paper provide several mathematical analyses of the diffusion model in machine learning. The drift term of the backwards sampling process is represented as a conditional expectation involving the data distribution and the forward diffusion. The training process aims to find such a drift function by minimizing the mean-squared residue related to the conditional expectation. Using small-time approximations of the Green's function of the forward diffusion, we show that the analytical mean drift function in DDPM and the score function in SGM asymptotically blow up in the final stages of the sampling process for singular data distributions such as those concentrated on lower-dimensional manifolds, and is therefore difficult to approximate by a network. To overcome this difficulty, we derive a new target function and associated loss, which remains bounded even for singular data distributions. We illustrate the theoretical findings with several numerical examples.  ( 2 min )
  • Open

    measure for difference between two distribution
    Hi, I'm looking for a metric that will describe the difference between 2 distributions. This is being used for classification/model selection, to pick a model that is most similar to a theoretical distribution. The distributions are empiric, that is from a non parametric bootstrap and a simulation. The differences in the distributions could be in the mean or the variance (but neither need to be normal). I've looked at Kullback-Lieber, but that is intended for a cumulative probability distribution, and that pretty much removes the mean effect since the sum of probabilities must = 1. Had some success with the kolmogorov smirnov distance, which seems useful, as well as the jensen shannon divergence and the Wasserstein distance. The Jeffreys divergence really seems like what I want, but doesn't see to exist for numerical/empiric distributions in 1 dimension. It seems that most of the metrics for differences in distributions are for probability distributions. My distributions are not that, they may, for example have a range of 1000-4000, not 0-1. Any other ideas? thanks Mark submitted by /u/marksale11 [link] [comments]  ( 41 min )

  • Open

    [D] Object detection or image classification? Training a model to recognize playing cards
    Hi all, I have been experimenting with object detection recently, using Faster R-CNN and YOLOv7 to train models on pre-existing datasets. Using a UNO card dataset I was able to quite accurately detect the type of UNO cards, based on the symbol in the top left corner. I used an object detection approach, with UNO cards only being categorized into 14 classes. Based on that, I am wondering what the best approach would be to enhance the model to use for other and more comprehensive card games. Thinking of card games like Munchkin for example, which has 1000s of different cards. For card games like this, object detection might not be the best approach having 1000s of different classes to consider. ​ The two different approaches I am considering: Using object detection, create as many classes as there are different playing cards in the game, training the model to detect every single card individually or Using object detection, use playing cards to train the model to detect the playing card itself, then use the detected playing card as input for an image classification algorithm ​ For me there are pros and cons to both methods: The first approach might be much more accurate, as it detects each card individually. On the other hand, it seems to me that it needs considerably more classes and data to feed into those classes. It also might be difficult to expand the model with more unique cards, as you would have to rerun the model every time. The second approach might not be as accurate, as it might not only detect playing cards but also identify other objects as playing cards. On the flip side, it seems to me that it is much easier to expand the model with more unique cards. ​ What might be the best approach here? Do you have a different approach to this, which might be more efficient? submitted by /u/Pallemann [link] [comments]  ( 43 min )
    [D] Speech enhancement - like Adobe Enhance/Audo Studio
    Does anyone here know how Audo / Adobe Enhance work under the hood? Just wondering what open-source tooling already exists of similar quality, likewise with data, and architectures? Would anyone be interested in whipping up something open-source that can be self-hosted? submitted by /u/NegotiationUpbeat545 [link] [comments]  ( 42 min )
    [D] Generate data that is not a dataset
    Hey everyone, I'm currently faced with the challenge of having to generate data that is deliberately not in a dataset. So if you think about the dataset as a distribution, the data points should have a possibly low probability. Additionally, each data point is a 30 dimensional vector and I know the min and max values for each dimension. How do I do that? What kinds of algorithms could I use for that? Can I somehow fit a distribution and sample low probability data points from it? Or a GAN for generating? Or are there obvious classical ML or statistical methods for that? submitted by /u/NiconiusX [link] [comments]  ( 42 min )
    [D] Not sure if time series or multiple classifications?
    I am beginning a problem similar to the one bellow for my work. There is a score 1-4 (1 is bad, 4 is very good) of a persons back sprain recovery. The data we have are back sprain recovery scores recorded after two weeks, 3 months and 6 months, along with information (features) about their behavior like sleep, medications, diet, and exercise. We want to predict there 2 week, 3 month, and 6 month back sprain recovery scores based on their initial behavior inputs. For example, given a user sleeps 8 hours a day, consumes x amount of sugar, does physical therapy 4 days a week, and takes x medication, what will there recovery scores be at 2 weeks, 3 months and 6 months? The training data would look like: ​ Sleep Average Medication Days of Physical Therapy Diet Week 2 recovery score Month 3 recovery score Month 6 recovery score 9 hours per night Advil 4 days/ week Healthy 2 3 4 5 hours per night None 0 days/week Unhealthy 1 2 2 ​ I want a model (or multiple models) to predict 3 values which is the 2 week, 3 month, and 6 month scores. I am not familiar with time series, but it seems like the data may be too sparse. Should I be using time series here, or should I create 3 classification models? submitted by /u/spiritualquestions [link] [comments]  ( 43 min )
    [R] Is there a way to combine a knowledge graph and other types of data for ML purposes?
    Hello, I really don't know how to frame this question but I wanted to ask if the was a way to integrate the relationships and nodes of a knowledge graph with recorded data. Like for example, when a knowledge graph contains information about relationships between features, can it be integrated with a dataset containing recorded or measured quantities of those features. The goal of this is to "infuse" the recorded dataset with relationships already known in the knowledge graph for some data analysis purpose. I know it sounds confusing but you can as for clarification on some details. Please help. submitted by /u/Low-Mood3229 [link] [comments]  ( 43 min )
    [D] Discrete vs. Continuous Normalizing Flows
    I'm working on developing methods for some density estimation and inverse modeling tasks on physics simulation data, and normalizing flow methods seem to be a pretty good tool for this job. I'm right now looking to implement a few different model flavors a la INNs and OT-Flow, and am interested in hearing some perspectives from people in the community who have worked with these kinds of models. What would you consider the current state of the art in normalizing flow methods? Most of what I'm finding in the discrete space seems to have converged on flavors of RealNVP, while OT-Flow seems to be the most advanced in the continuous space. Beyond the benchmark performance metrics tabulated in the literature, what can we say about when to prefer continuous vs. discrete models? It's obviously going to be problem-dependent to some degree, but are there general heuristics to be aware of here? For continuous models (and implicit layer methods more generally), where are the research threads currently at improving runtime performance? The 2020 NeurIPS tutorial on implicit layers (link) has been helpful, but it would be interesting to know how things have advanced since then. Any and all insights would be appreciated! submitted by /u/nuclear_knucklehead [link] [comments]  ( 43 min )
    [N] ESANN 2023 | Special Session on Neuro-Symbolic AI (CFP)
    Neuro-symbolic AI is a promising approach to artificial intelligence that aims to combine the strengths of symbolic reasoning and probabilistic systems. For example, combining inductive logic programming and deep learning with applications in graphs, vision, reasoning and explainability. In this special session, we will provide an overview of neuro-symbolic AI, key concepts, and current state-of-the-art techniques. We will also discuss the potential benefits and challenges of neuro-symbolic AI and its potential impact on various fields and applications. In addition to the tutorial, we welcome contributions from attendees in the context of neuro-symbolic AI. This includes but is not limited to: • Novel neuro-symbolic models and techniques • Applications of neuro-symbolic AI to real-world problems • Empirical evaluations and comparisons of neuro-symbolic AI approaches • Theoretical foundations and analysis of neuro-symbolic AI • Emerging trends and challenges in the field of neuro-symbolic AI ​ Submission guidelines: https://www.esann.org/node/6 Paper submission deadline: 2 May 2023 Conference date: 4-6 October 2023 Conference location: Crowne Plaza hotel Bruges, Belgium submitted by /u/iav_tf_h [link] [comments]  ( 42 min )
    [D] Computationally light-weight deep learning research topics?
    Hello, I am familiar with the theory of differentiable computing and SGD training having done some research work on semi-supervised learning for image classification and semantic/panoptic segmentation. In other words, I am familiar with understanding implementing state-of-the-art proposals as well as tweaking them. Now I'm an unemployed and interested in conducting some research using PyTorch and Google Colab which seems feasible only if the problem or topic at hand is relatively low-cost. So I'm asking the question: What are some deep learning (metalearning,regularization, non-supervised training) or applied DL (CV/NLP/...) topics or datasets that are lightweight enough to be researched with just one GPU? Thanks and have a nice weekend submitted by /u/iamnotlefthanded666 [link] [comments]  ( 42 min )
    [D] "Deep Learning Tuning Playbook" (recently released by Google Brain people)
    https://github.com/google-research/tuning_playbook - Google has released a playbook (solely) about how to tune hyper-parameters of neural networks. Disclaimer: I am unrelated to this repository, just came across it and thought it is suitable for this subreddit. I have searched through and found no posts, thus I post it to hear some comments/insights from you ;) submitted by /u/fzyzcjy [link] [comments]  ( 44 min )
    [D] Did YouTube just add upscaling?
    So, these pictures below are taken from a 144p video on YouTube. You cannot tell me that these aren't CNN upscaling artefacts. So this raises the question of.... how exactly is this implemented? What model are they using which is tiny enough to run on (i assume) WebGL2? Is it a CNN inside of GLSL shaders? Is it something else? CPU side or GPU side? And also... how have I not seen a single other person pointing this out, anywhere on the internet. Believe me I looked. Ain't no one talking about this. EDIT: UPDATE this is doing it in ALL videos in chrome now. It only works in Chrome, not in Discord or Edge, so its not GPU/Windows fuckery. But the strange thing is other friends testing this with the same version of Chrome ***DONT*** have this? And the even stranger thing is... this is running on Intel Integrated Graphics... https://preview.redd.it/jnjwjzyag7da1.png?width=3240&format=png&auto=webp&s=504c9fa6ba41ae3a5b1266fe17e519839d3cf933 https://preview.redd.it/6vzyx5f1g7da1.png?width=1182&format=png&auto=webp&s=b56df5017d1fb742f042c847e42818d7f05a1888 https://preview.redd.it/bo36ko40g7da1.png?width=365&format=png&auto=webp&s=1777e238a7299084da9e10eb62c6c2539dc5cc86 https://preview.redd.it/16zpxwqyf7da1.png?width=333&format=png&auto=webp&s=38b9f2f4eb1a5999ad212aab6247e41a913a5294 submitted by /u/Avelina9X [link] [comments]  ( 46 min )
    [N] OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic
    https://time.com/6247678/openai-chatgpt-kenya-workers/ submitted by /u/ChubChubkitty [link] [comments]  ( 54 min )
    [P] paper-hero: Yet Another Paper Search Tool
    Hi guys, thanks for reading this post. I built a simplistic paper search tool that integrates ACL Anthology, arXiv API, and DBLP API. Github address: Spico197/paper-hero Motivation: I'm majoring NLP and I'd like to search for papers with "Event Extraction" as titles in specific proceedings (e.g. ACL, EMNLP). Challenge: There are lots of search tools and APIs, but few of them provide field-specific searches, like authors, titles, abstracts, and venues. Methodology: I integrate ACL Anthology, arXiv API, and DBLP API, and provide a two-stage search toolkit, which first stores target papers via the official fuzzy search API, and then matches specific fields. Advantages: This tool satisfies my need to stockpile papers and it can dump checklists in markdown format, or complete paper information in jsonl. AND and OR logics are supported in search queries. Limitations: This tool is based on simple string matching, so you have to know some terminologies in the target fields. You are warmly welcome to have a try and feel free to drop me an issue! from src.interfaces.aclanthology import AclanthologyPaperList from src.utils import dump_paper_list_to_markdown_checklist if __name__ == "__main__": # use `bash scripts/get_aclanthology.sh` to download and prepare anthology data first paper_list = AclanthologyPaperList("cache/aclanthology.json") ee_query = { "title": [ # Any of the strings below is matched ["information extraction"], ["event", "extraction"], # title must include `event` and `extraction` ["event", "argument", "extraction"], ["event", "detection"], ["event", "classification"], ["event", "tracking"], ["event", "relation", "extraction"], ], # Besides the title constraint, venue must also meet the needs "venue": [ ["acl"], ["emnlp"], ["naacl"], ["coling"], ["findings"], ["tacl"], ["cl"], ], } ee_papers = paper_list.search(ee_query) dump_paper_list_to_markdown_checklist(ee_papers, "results/ee-paper-list.md") ​ markdown checklist submitted by /u/Spico197 [link] [comments]  ( 44 min )
  • Open

    Hey anyone know where I could find a AI assisted writer with no world limit, or at least a VERY high one
    Title submitted by /u/Zan_korida [link] [comments]  ( 40 min )
    Consider this: You know how chatGPT says load failed when its giving YOU the most groundbreaking answer ever? I bet: It’s sent to OpenAI, you get “Load failed” and a less profound answer on regenerate.
    It’s the free preview. No breakthroughs for you. They collect amazing breakthroughs by the minute from humans working the AI. Just add a weight for profoundness 1-10. At 7, crash and send. User never gets it. Thoughts? submitted by /u/Overall-Importance54 [link] [comments]  ( 40 min )
    AI art - automation. A working artist's take.
    submitted by /u/WSCOKN [link] [comments]  ( 40 min )
    This website was created by an AI chatbot, and all of the content was generated by an AI image generator.
    submitted by /u/FreePixelArt [link] [comments]  ( 40 min )
    Amazon Wants To Help Community Colleges with AI
    Amazon has launched an "educator enablement" program to help instructors at community colleges, HBCUs, and other minority-serving institutions learn and teach AI. The professional development program will help college instructors gain a generalist AI skillset. Amazon will provide $1,200 and continuing education credits to 330 participants who complete one of the six boot camps being offered over the course of 2023. For colleges that don't get selected for the educator enablement cohort, Amazon plans to make curriculum materials for any interested college at no cost through Github, YouTube, and AWS Academy This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/google-brings-in-legendary-duo-for-chatgpt-battle submitted by /u/Mk_Makanaki [link] [comments]  ( 40 min )
    "Sentient AI" - Example Of Just How Easy It Is To Prompt A Fake Sentient AI(GPT3)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    AI WARS: Explaining Google's painfully long, 15k-word! tome about their AI plans... or lack thereof
    We covered this in our newsletter today. Here it is verbatim-- if you find it useful, hit the link and sub: https://smokingrobot.beehiiv.com/p/ai-wars ​ Microsoft has dominated BIG TECH headlines over the last few months, thanks largely to a drumbeat of headlines involving their partner OpenAI and its world-shaping ChatGPT. So awe-striking is Microsoft's hand right now, it has made rival companies' advancements, like Apple's recently announced and insanely powerful M2 MacBook Pros, look pedestrian in comparison. But now Google has entered the chat. And by "entered the chat", we mean that CEO Sundar Pichai - Pich-AI? - released a distressingly long 15,000-word(!) treatise on its own endeavors in AI, signaling a counter attack... maybe... at some point in the future... when and if it…  ( 45 min )
    A ChatGPT software engineer in your pocket
    Having AI tools like ChatGPT is like having a personal software engineer in your pocket. But most people don't know how to craft prompts for code. Here's how you can get AI to write software for you: ​ https://preview.redd.it/tfkoz3x4v8da1.png?width=686&format=png&auto=webp&s=11a88815cb44e68fa2a5f63aa99f5867ac0aa755 submitted by /u/Imagine-your-success [link] [comments]  ( 40 min )
    DREAMBOOTH: 10 MINS TRAINING Inside Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Powerful Tools to Test and Improve your Chatbot! 🔥
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Walking simulators
    For a while Ive been seeing a lot of videos like these: https://youtu.be/wqvAconYgK0 https://youtu.be/qvpXpCvkqbc https://youtu.be/kQ2bqz3HPJE And Ive been wondering what program these might be using, and if its open to the public. If not is there any program that any of you might suggest to achieve identical if not similar results as I have a few ideas on how this could be utilised within a game engine and animation workspace Thanks submitted by /u/SyhrNewo [link] [comments]  ( 40 min )
    Google plans chatbot search engine and 20 new AI products
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    🚀Online Real-Time Volumetric Nerf + Slam
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Master Thesis about AI-designed jewelry
    Dear all! My master's thesis is coming to an end, and now I am seeking valuable insights and important data into my topic through a survey. This survey aims to understand consumer awareness and purchase intentions regarding jewelry created with the help of artificial intelligence. You can find the link here: https://ucpresearch.qualtrics.com/.../SV_ex2OALp6fPXQ8Qe I would be grateful if you participated in large numbers and provide me with valuable insights on the current topic. Thank you!! submitted by /u/Ok-Rise453 [link] [comments]  ( 40 min )
    ChatGPT Trend
    submitted by /u/Realistic-Plant3957 [link] [comments]  ( 40 min )
    Are TensorFlow and other ML frameworks worth learning in 2023?
    For some explanation, I am more familiar with PyTorch but I wanted to refresh my knowledge of machine learning and deep learning concepts. However, with the recent trend of moving away from TensorFlow and towards PyTorch, I wonder if TensorFlow and other ML frameworks are still worth learning today. I know certain algorithms are exclusively implemented in one or the other framework. I think at least TensorFlow is good since they’re well-documented and put to the test by other experts. But I’m not so sure about the case of newer custom frameworks. If I dive into them, it could be a step back for me especially after reading this article. It talks about newer ML framework launches and how a lot of people try it out at first, but then interest in said frameworks starts to decrease. I know there are a bunch of good custom frameworks out there but it might take more time for new tools to become mainstream or eventually die down. Which is why I’m afraid to use them at the moment. Let me know what you all think! Thanks! submitted by /u/ActionParticular7697 [link] [comments]  ( 41 min )
    Boston Dynamics reveals Atlas AI robot new ability to grip + autonomously manipulate objects | New 3D modeling Geocode AI creates + Edits highly realistic meshes | Breakthrough Text-To-Video "Tune a Video" uses diffusion models to output coherent video
    submitted by /u/SedatelyMake [link] [comments]  ( 40 min )
    Space Locator
    Hey all. My first post here. I’ll keep it as relevant as possible. I want to develop a AI model which will determine that if the space in a room is enough or not. Like it’ll determine whether the space in the room in enough or not. We’ll provide the pictures of the room and it’ll give us the output. I think there might be some pre-trained models out there which might be helpful. Please guide me in this regard if there are some models or where should I start. I’ll be grateful. Thank you so much in advance submitted by /u/h3artb3att [link] [comments]  ( 40 min )
    Porter Robinson music continued by OpenAI Jukebox
    submitted by /u/anoneemoosh [link] [comments]  ( 40 min )
    ChatGPT Accepted As Co-Author On Multiple Research Papers
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Advancements in Natural Language Processing (NLP) and its applications in various industries
    Natural Language Processing (NLP) is a rapidly growing field within the realm of Artificial Intelligence (AI) that is revolutionizing the way we interact with machines. NLP is a branch of AI that deals with the interaction between human language and computers. It is used to analyze, understand and generate human language, and it has a wide range of applications in various industries. One of the most significant advancements in NLP is the development of deep learning algorithms, which have greatly improved the accuracy and efficiency of NLP models. These algorithms have enabled the development of more sophisticated NLP systems, such as those that can understand context and generate human-like responses. One of the most prominent applications of NLP is in the customer service industry. Com…  ( 42 min )
    TextCortex AI: AI Writing Companion - (Our free browser extension)
    submitted by /u/Ruzuyu [link] [comments]  ( 40 min )
    How do you get AI art generators to produce amazing images that look like real art? Take a text-guided diffusion model and feed it the ideal text prompt with the right keywords
    Paper: https://arxiv.org/abs/2209.11711 Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. https://preview.redd.it/ei4wm8tv66da1.png?width=1852&format=png&auto=webp&s=e20c196fe9a543e6b4ec58b3d4f689624db7f95d submitted by /u/Ok_Mine_5742 [link] [comments]  ( 40 min )
    Created by Stable Diffusion
    submitted by /u/NorthTs [link] [comments]  ( 40 min )
  • Open

    Road Map to Machine Learning & Deep Learning
    A Good Road Map To Machine Learning enginner  ( 14 min )
    Sleep disorders: can AI and Digital Twin help?
    According to the National Sleep Foundation, it is estimated that 50–70 million adults in the United States have a sleep disorder.  ( 25 min )
    Top 10 AI Applications in HRM
    Artificial Intelligence (AI) is revolutionizing the way businesses operate, and the field of Human Resource Management (HRM) is no…  ( 7 min )
    Unlocking the Power of Time Series Forecasting: A Step-by-Step Guide with Code Examples in Python
    Time series forecasting is the process of using a model to predict future values of a time series based on its past values. Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )
  • Open

    Exposing Reliability Degradation and Mitigation in Approximate DNNs under Permanent Faults
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
    Machine vision within airports
    My first idea was to use simple object recognition on the x-ray machines in airports to detect weapons. As blunt objects are hard for humans to detect, and so are plastic weapons. Surely this could be solved with a simple object recognition algorithm. I then started researching airport security, and came across, without a doubt, the safest airport in the world, the Israeli airport. They have so many incredibly invasive processes, such as interviews with highly trained personnel, who are trained to spot a liar, and officers following you around if you are deemed "high-risk" (sidenote: they are also unapologetically racist about who they deem as high-risk). Couldn't both of those processes be automated? The interviewing could be done by a non-trained worker, then have cameras that analyse the person. And the officers following people around could also be done using CCTV and some machine vision. Or even analyse people when they're walking around I'm not saying that the Iranian airport should do this, as it would be a step-down, which they clearly will not accept. Instead could this not be done in western airports, as there have been many reports indicating the lack of success when it comes to them catching anyone? Could they implement poor versions of the Iranian airport security system? ​ P.S. I recently visited Senegal (west Africa). On the way back from Senegal to the UK I was totally unaware that I had a large water bottle in my bag, and the screening was so useless that they never found it. The guy manning the scanner was watching TikTok on his phone. When I landed I then could have gone anywhere in the world without being checked again, which goes to show how much of a waste of time the current airport security is. submitted by /u/Tom_nerd [link] [comments]  ( 41 min )
    Which gpu would be better for training? Linux, PyTorch, I have a gigabyte rtx 2070 and amp FirePro s9300 x2?
    Amd* submitted by /u/DaOnlyBaby [link] [comments]  ( 40 min )
    🚀Online Real-Time Volumetric Nerf + Slam
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    How do you get AI art generators to produce amazing images that look like real art? Take a text-guided diffusion model and feed it the ideal text prompt with the right keywords
    Paper: https://arxiv.org/abs/2209.11711 Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. https://preview.redd.it/p47ovsfc66da1.png?width=1852&format=png&auto=webp&s=516eda2724c4f254b59109fd46a65dba8ffd79a1 submitted by /u/Ok_Mine_5742 [link] [comments]  ( 40 min )
  • Open

    What are the current state-of-the-art algorithms?
    What algorithms are currently considered state-of-the-art? I’m specifically interested in those which are off-policy as I have found DQN to be the best choice for my application so far. Some algorithms I’m considering trying are R2D2, Duelling DQN and Rainbow DQN. Are there any others that could be worth a look? submitted by /u/centripetalstranger [link] [comments]  ( 40 min )
    In RL, how does one provide a theoretical justification of why one algorithm works better than the other?
    Completely random example, let's say that your experiments consistently demonstrate that a recurrent policy (LSTM) in PPO works better than a linear policy, more specifically in one kind of environments (say, environments that require cooperation between agents). Now, how do you justify theoretically your empirical finding? In other words, how do you explain the theory behind the linear policy's limitations? submitted by /u/No_Possibility_7588 [link] [comments]  ( 43 min )
    DQN for simple toy env
    Dear RL community, ​ I'm trying to get DQN (using the stable baselines 3 implementation) to solve my toy environment. No matter how many different hyperparameter configurations I try, I can't get it to work. ​ Here are some details about the environment (with num_objects=1). - the agent is essentially controlling a gripper that moves to one of the squares, grasps the item, and then moves to the target location. - it's a grid world set on 2 levels with each level of size 3x3 - the available actions are one for each square (the agent is teleported there) and 2 more for grasping and releasing. total of 20 discrete actions. - the reward provides a signal as the distance between the object and the target goal, a big reward if succeeded or -1 when it does something bad. ​ Here's my training code. I've also added a human_step function that can be used to control the gripper yourself. ​ Do you have any insights as to why it's not working? Please ask if you need more details about any part of the implementation. ​ Many thanks! submitted by /u/Mr_Physic13 [link] [comments]  ( 41 min )
    How to proceed scientifically when your hypothesis is falsified?
    I predicted that a certain change in the architecture of my agents would boost their coordination (in the context of multi-agent reinforcement learning). However, I tested this in the Meetup environment and it is not working, in the sense that it performs slightly worse than the baseline. This is how the environment works: three agents must collectively choose one of K landmarks and congregate near it. At each time step, each agent receives reward equal to the change in distance between itself and the landmark closest to all three agents. The goal landmark changes depending on the current position of all agents. When all K agents are adjacent to the same landmark, the agents receive a bonus of 1 and the episode ends. Scientifically speaking, how can I be rigorous about testing this hypothesis again? A few ideas: 1 Repeat the experiment multiple times with different random seeds to ensure that the results are robust and not influenced by random variations. 2 Vary the parameters of the agent Vary the number of modules used in the policy and test the effect on coordination. Increase the number of agents 3 Vary the parameters of the environment Changing the number of landmarks Adding distractors 4 Test another environment What do you think? - submitted by /u/No_Possibility_7588 [link] [comments]  ( 42 min )
    Continuous action space that should often return an exact value that's inputted (attempt with reinforce algorithm)
    I was wondering if this is bad practice or bound to fail. I'm in an environment where the state is x1 plus a bunch of other things call them x2. Sometimes the best action is x1 (and exactly x1). Sometimes the best action is a function of x2 (and x1). What I was thinking (and am trying to so far little success) is to have the reinforce algorithm output two gaussian action values. The first, a1, is f(x1,x2). The second, a2, is put into a sigmoid function and if it's greater than 0.5, use action f(x2,x1) and otherwise x1. It seems weird to me that in the reinforce algorithm I'd still use the log probability of both action values if a2<0.5 in which case a1 doesn't get used at all. In that case should I find the probability only of a2 and use that instead? If this idea is completely off base and you have suggestions please lmk. submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Environment for General AI using Reinforcement Learning?
    I have many custom environments with these features: Observation, Valid Actions at a specific Observation, Sparse reward when the episode is terminated, either 1 or 0 I want to build a Reinforcement Learning Agent that can perform well in these environments. I started this personal project 2 years ago and it get harder the more I try to do it. Is there any other environments like this, or what paper can I read more about this kind of Reinforcement Learning ? submitted by /u/Open_Ranger4375 [link] [comments]  ( 41 min )
    agent not learning using dqn.
    Hello forum, I am trying to get a single joint actuated link to stand upright. I am using dqn as a method for the agent to learn. I have tried using different inputs and outputs with the neural net but is still failing to learn. Can you take a look at my code? #!/usr/bin/env python3 from interbotix_xs_modules.arm import InterbotixManipulatorXS import rospy import time import numpy as np import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim import math import random import matplotlib.pyplot as plt ​ from std_msgs.msg import Float64 from gazebo_msgs.msg import LinkStates from geometry_msgs.msg import Pose, Twist from std_srvs.srv import Empty from sensor_msgs.msg import JointState ​ bot = InterbotixManipulatorXS(robot_model="rx15…  ( 44 min )
  • Open

    ­­How CCC Intelligent Solutions created a custom approach for hosting complex AI models using Amazon SageMaker
    This post is co-written by Christopher Diaz, Sam Kinard, Jaime Hidalgo and Daniel Suarez  from CCC Intelligent Solutions. In this post, we discuss how CCC Intelligent Solutions (CCC) combined Amazon SageMaker with other AWS services to create a custom solution capable of hosting the types of complex artificial intelligence (AI) models envisioned. CCC is a […]  ( 13 min )
  • Open

    Oval orbits?
    Johannes Kepler thought that planetary orbits were ellipses. Giovanni Cassini thought they were ovals. Kepler was right, but Cassini wasn’t far off. In everyday speech, people use the words ellipse and oval interchangeably. But in mathematics these terms are distinct. There is one definition of an ellipse, and several definitions of an oval. To be […] Oval orbits? first appeared on John D. Cook.  ( 6 min )
    Cassini ovals
    An ellipse can be defined as the set of points such that the sum of the distances to two fixed points, the foci, has a constant value. A Cassini oval is the set of points such that the product of the distances to two foci has a constant value. You can write down an equation […] Cassini ovals first appeared on John D. Cook.  ( 5 min )
    Bounds on power series coefficients
    Let f be an analytic function on the unit disk with f(0) = 0 and derivative f ′(0) = 1. If f is one-to-one (injective) then this puts a strict limit on the size of the series coefficients. Let an be the nth coefficient in the power series for f centered at 0. If f is one-to-one […] Bounds on power series coefficients first appeared on John D. Cook.  ( 5 min )
  • Open

    What Is AI Computing?
    The abacus, sextant, slide rule and computer. Mathematical instruments mark the history of human progress. They’ve enabled trade and helped navigate oceans, and advanced understanding and quality of life. The latest tool propelling science and industry is AI computing. AI Computing Defined AI computing is the math-intensive process of calculating machine learning algorithms, typically using Read article >  ( 8 min )
  • Open

    MIT researchers develop an AI model that can detect future lung cancer risk
    Deep-learning model takes a personalized approach to assessing each patient’s risk of lung cancer based on CT scans.  ( 10 min )
  • Open

    Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
    Reinforcement learning provides a conceptual framework for autonomous agents to learn from experience, analogously to how one might train a pet with treats. But practical applications of reinforcement learning are often far from natural: instead of using RL to learn through trial and error by actually attempting the desired task, typical RL applications use a separate (usually simulated) training phase. For example, AlphaGo did not learn to play Go by competing against thousands of humans, but rather by playing against itself in simulation. While this kind of simulated training is appealing for games where the rules are perfectly known, applying this to real world domains such as robotics can require a range of complex approaches, such as the use of simulated data, or instrumenting real-wo…  ( 4 min )
  • Open

    Reverse Differentiation via Predictive Coding. (arXiv:2103.04689v3 [cs.LG] UPDATED)
    Deep learning has redefined the field of artificial intelligence (AI) thanks to the rise of artificial neural networks, which are architectures inspired by their neurological counterpart in the brain. Through the years, this dualism between AI and neuroscience has brought immense benefits to both fields, allowing neural networks to be used in dozens of applications. These networks use an efficient implementation of reverse differentiation, called backpropagation (BP). This algorithm, however, is often criticized for its biological implausibility (e.g., lack of local update rules for the parameters). Therefore, biologically plausible learning methods that rely on predictive coding (PC), a framework for describing information processing in the brain, are increasingly studied. Recent works prove that these methods can approximate BP up to a certain margin on multilayer perceptrons (MLPs), and asymptotically on any other complex model, and that zero-divergence inference learning (Z-IL), a variant of PC, is able to exactly implement BP on MLPs. However, the recent literature shows also that there is no biologically plausible method yet that can exactly replicate the weight update of BP on complex models. To fill this gap, in this paper, we generalize (PC and) Z-IL by directly defining them on computational graphs, and show that it can perform exact reverse differentiation. What results is the first biologically plausible algorithm that is equivalent to BP in the way of updating parameters on any neural network, providing a bridge between the interdisciplinary research of neuroscience and deep learning.  ( 2 min )
    Digital Twin-Based Multiple Access Optimization and Monitoring via Model-Driven Bayesian Learning. (arXiv:2210.05582v2 [eess.SP] UPDATED)
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control and monitor software-based, "open", communication systems, which play the role of the physical twin (PT). In the general framework presented in this work, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL) and monitoring of the PT for anomaly detection. We specifically investigate the application of the proposed framework to a simple case-study system encompassing multiple sensing devices that report to a common receiver. The Bayesian model trained at the DT has the key advantage of capturing epistemic uncertainty regarding the communication system, e.g., regarding current traffic conditions, which arise from limited PT-to-DT data transfer. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact. (arXiv:2106.02359v3 [cs.CL] UPDATED)
    Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via the moral philosophy definition of social good, propose a framework to evaluate the direct and indirect real-world impact of NLP tasks, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and code are available at this http URL In addition, we curate a list of papers and resources on NLP for social good at https://github.com/zhijing-jin/NLP4SocialGood_Papers.  ( 2 min )
    - Modelling Difference Between Censored and Uncensored Electric Vehicle Charging Demand. (arXiv:2301.06418v2 [cs.AI] UPDATED)
    Electric vehicle charging demand models, with charging records as input, will inherently be biased toward the supply of available chargers, as the data do not include demand lost from occupied stations and competitors. This lost demand implies that the records only observe a fraction of the total demand, i.e. the observations are censored, and actual demand is likely higher than what the data reflect. Machine learning models often neglect to account for this censored demand when forecasting the charging demand, which limits models' applications for future expansions and supply management. We address this gap by modelling the charging demand with probabilistic censorship-aware graph neural networks, which learn the latent demand distribution in both the spatial and temporal dimensions. We use GPS trajectories from cars in Copenhagen, Denmark, to study how censoring occurs and much demand is lost due to occupied charging and competing services. We find that censorship varies throughout the city and over time, encouraging spatial and temporal modelling. We find that in some regions of Copenhagen, censorship occurs 61% of the time. Our results show censorship-aware models provide better prediction and uncertainty estimation in actual future demand than censorship-unaware models. Our results suggest that future models based on charging records should account for the censoring to expand the application areas of machine learning models in this supply management and infrastructure expansion.  ( 2 min )
    Scalable Deep Graph Clustering with Random-walk based Self-supervised Learning. (arXiv:2112.15530v2 [cs.LG] UPDATED)
    Web-based interactions can be frequently represented by an attributed graph, and node clustering in such graphs has received much attention lately. Multiple efforts have successfully applied Graph Convolutional Networks (GCN), though with some limits on accuracy as GCNs have been shown to suffer from over-smoothing issues. Though other methods (particularly those based on Laplacian Smoothing) have reported better accuracy, a fundamental limitation of all the work is a lack of scalability. This paper addresses this open problem by relating the Laplacian smoothing to the Generalized PageRank and applying a random-walk based algorithm as a scalable graph filter. This forms the basis for our scalable deep clustering algorithm, RwSL, where through a self-supervised mini-batch training mechanism, we simultaneously optimize a deep neural network for sample-cluster assignment distribution and an autoencoder for a clustering-oriented embedding. Using 6 real-world datasets and 6 clustering metrics, we show that RwSL achieved improved results over several recent baselines. Most notably, we show that RwSL, unlike all other deep clustering frameworks, can continue to scale beyond graphs with more than one million nodes, i.e., handle web-scale. We also demonstrate how RwSL could perform node clustering on a graph with 1.8 billion edges using only a single GPU.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v2 [math.ST] UPDATED)
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Adversarial AI in Insurance: Pervasiveness and Resilience. (arXiv:2301.07520v1 [cs.LG])
    The rapid and dynamic pace of Artificial Intelligence (AI) and Machine Learning (ML) is revolutionizing the insurance sector. AI offers significant, very much welcome advantages to insurance companies, and is fundamental to their customer-centricity strategy. It also poses challenges, in the project and implementation phase. Among those, we study Adversarial Attacks, which consist of the creation of modified input data to deceive an AI system and produce false outputs. We provide examples of attacks on insurance AI applications, categorize them, and argue on defence methods and precautionary systems, considering that they can involve few-shot and zero-shot multilabelling. A related topic, with growing interest, is the validation and verification of systems incorporating AI and ML components. These topics are discussed in various sections of this paper.  ( 2 min )
    Global Contrastive Batch Sampling via Optimization on Sample Permutations. (arXiv:2210.12874v3 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Weight Matrix Dimensionality Reduction in Deep Learning via Kronecker Multi-layer Architectures. (arXiv:2204.04273v2 [cs.LG] UPDATED)
    Deep learning using neural networks is an effective technique for generating models of complex data. However, training such models can be expensive when networks have large model capacity resulting from a large number of layers and nodes. For training in such a computationally prohibitive regime, dimensionality reduction techniques ease the computational burden, and allow implementations of more robust networks. We propose a novel type of such dimensionality reduction via a new deep learning architecture based on fast matrix multiplication of a Kronecker product decomposition; in particular our network construction can be viewed as a Kronecker product-induced sparsification of an "extended" fully connected network. Analysis and practical examples show that this architecture allows a neural network to be trained and implemented with a significant reduction in computational time and resources, while achieving a similar error level compared to a traditional feedforward neural network.  ( 2 min )
    Dirichlet-Neumann learning algorithm for solving elliptic interface problems. (arXiv:2301.07361v1 [math.NA])
    Non-overlapping domain decomposition methods are natural for solving interface problems arising from various disciplines, however, the numerical simulation requires technical analysis and is often available only with the use of high-quality grids, thereby impeding their use in more complicated situations. To remove the burden of mesh generation and to effectively tackle with the interface jump conditions, a novel mesh-free scheme, i.e., Dirichlet-Neumann learning algorithm, is proposed in this work to solve the benchmark elliptic interface problem with high-contrast coefficients as well as irregular interfaces. By resorting to the variational principle, we carry out a rigorous error analysis to evaluate the discrepancy caused by the boundary penalty treatment for each decomposed subproblem, which paves the way for realizing the Dirichlet-Neumann algorithm using neural network extension operators. The effectiveness and robustness of our proposed methods are demonstrated experimentally through a series of elliptic interface problems, achieving better performance over other alternatives especially in the presence of erroneous flux prediction at interface.  ( 2 min )
    CLIPTER: Looking at the Bigger Picture in Scene Text Recognition. (arXiv:2301.07464v1 [cs.CV])
    Understanding the scene is often essential for reading text in real-world scenarios. However, current scene text recognizers operate on cropped text images, unaware of the bigger picture. In this work, we harness the representative power of recent vision-language models, such as CLIP, to provide the crop-based recognizer with scene, image-level information. Specifically, we obtain a rich representation of the entire image and fuse it with the recognizer word-level features via cross-attention. Moreover, a gated mechanism is introduced that gradually shifts to the context-enriched representation, enabling simply fine-tuning a pretrained recognizer. We implement our model-agnostic framework, named CLIPTER - CLIP Text Recognition, on several leading text recognizers and demonstrate consistent performance gains, achieving state-of-the-art results over multiple benchmarks. Furthermore, an in-depth analysis reveals improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.  ( 2 min )
    Prompting Large Language Model for Machine Translation: A Case Study. (arXiv:2301.07069v2 [cs.CL] UPDATED)
    Research on prompting has shown excellent performance with little or even no supervised training across many tasks. However, prompting for machine translation is still under-explored in the literature. We fill this gap by offering a systematic study on prompting strategies for translation, examining various factors for prompt template and demonstration example selection. We further explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning in prompting. Extensive experiments with GLM-130B (Zeng et al., 2022) as the testbed show that 1) the number and the quality of prompt examples matter, where using suboptimal examples degenerates translation; 2) several features of prompt examples, such as semantic similarity, show significant Spearman correlation with their prompting performance; yet, none of the correlations are strong enough; 3) using pseudo parallel prompt examples constructed from monolingual data via zero-shot prompting could improve translation; and 4) improved performance is achievable by transferring knowledge from prompt examples selected in other settings. We finally provide an analysis on the model outputs and discuss several problems that prompting still suffers from.  ( 2 min )
    Neural DAEs: Constrained neural networks. (arXiv:2211.14302v2 [cs.LG] UPDATED)
    In this article we investigate the effect of explicitly adding auxiliary trajectory information to neural networks for dynamical systems. We draw inspiration from the field of differential-algebraic equations and differential equations on manifolds and implement similar methods in residual neural networks. We discuss constraints through stabilization as well as projection methods, and show when to use which method based on experiments involving simulations of multi-body pendulums and molecular dynamics scenarios. Several of our methods are easy to implement in existing code and have limited impact on training performance while giving significant boosts in terms of inference.  ( 2 min )
    Nostradamus: Weathering Worth. (arXiv:2212.05933v2 [q-fin.ST] UPDATED)
    Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate data, and various climate indicators such as carbon dioxide emissions. We have conducted our study based on the US financial market, global climate trends, and daily weather records to demonstrate a significant relationship between climate and stock price fluctuation. Our analysis covers both short-term and long-term rises and dips in company stock performances. Lastly, we take four natural disasters as a case study to observe the effect they have on people's emotional state and their influence on the stock market.  ( 2 min )
    Quantification of geogrid lateral restraint using transparent sand and deep learning-based image segmentation. (arXiv:2212.02939v2 [physics.geo-ph] UPDATED)
    An experimental technique is presented to quantify the lateral restraint provided by a geogrid embedded in granular soil at the particle level. Repeated load triaxial tests were done on transparent sand specimens with geosynthetic inclusions simulating geogrids. Particle outlines on laser illuminated planes through the specimens were segmented using a deep learning-based segmentation algorithm. The particle outlines were characterized in terms of Fourier shape descriptors and tracked across sequentially captured images. The accuracy of the particle displacement measurements was validated against Digital Image Correlation (DIC) measurements. In addition, the method's resolution and repeatability is presented. Based on the measured particle displacements and rotations, a state boundary line between probable and improbable particle motions was identified for each test. The size of the zone of probable motions could be used to quantify the lateral restraint provided by the inclusions. Overall, the tests results revealed that the geosynthetic inclusions restricted both particle displacements and rotations. However, the particle displacements were found to be restrained more significantly than the rotations. Finally, a unique relationship was found between the magnitude of the permanent strains of the specimens and the size of the zone of probable motions.  ( 2 min )
    Generalized Many-Body Dispersion Correction through Random-phase Approximation for Chemically Accurate Density Functional Theory. (arXiv:2210.09784v4 [physics.chem-ph] UPDATED)
    We extend our recently proposed Deep Learning-aided many-body dispersion (DNN-MBD) model to quadrupole polarizability (Q) terms using a generalized Random Phase Approximation (RPA) formalism, thus enabling the inclusion of van der Waals contributions beyond dipole. The resulting DNN-MBDQ model only relies on ab initio-derived quantities as the introduced quadrupole polarizabilities are recursively retrieved from dipole ones, in turn modelled via the Tkatchenko-Scheffler method. A transferable and efficient deep-neuronal network (DNN) provides atom in molecule volumes, while a single range-separation parameter is used to couple the model to Density Functional Theory (DFT). Since it can be computed at a negligible cost, the DNN-MBDQ approach can be coupled with DFT functionals such as PBE,PBE0 and B86bPBE (dispersionless). The DNN-MBQ-corrected functionals reach chemical accuracy while exhibiting lower errors compared to their dipole-only counterparts.  ( 2 min )
    Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models. (arXiv:2301.06267v2 [cs.CV] UPDATED)
    The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.  ( 2 min )
    Spiking Neural Network Decision Feedback Equalization. (arXiv:2211.04756v3 [eess.SP] UPDATED)
    In the past years, artificial neural networks (ANNs) have become the de-facto standard to solve tasks in communications engineering that are difficult to solve with traditional methods. In parallel, the artificial intelligence community drives its research to biology-inspired, brain-like spiking neural networks (SNNs), which promise extremely energy-efficient computing. In this paper, we investigate the use of SNNs in the context of channel equalization for ultra-low complexity receivers. We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE). For conversion of real-world data into spike signals we introduce a novel ternary encoding and compare it with traditional log-scale encoding. We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels. We highlight that mainly the conversion of the channel output to spikes introduces a small performance penalty. The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.  ( 2 min )
    Antenna Array Calibration Via Gaussian Process Models. (arXiv:2301.06582v2 [eess.SP] UPDATED)
    Antenna array calibration is necessary to maintain the high fidelity of beam patterns across a wide range of advanced antenna systems and to ensure channel reciprocity in time division duplexing schemes. Despite the continuous development in this area, most existing solutions are optimised for specific radio architectures, require standardised over-the-air data transmission, or serve as extensions of conventional methods. The diversity of communication protocols and hardware creates a problematic case, since this diversity requires to design or update the calibration procedures for each new advanced antenna system. In this study, we formulate antenna calibration in an alternative way, namely as a task of functional approximation, and address it via Bayesian machine learning. Our contributions are three-fold. Firstly, we define a parameter space, based on near-field measurements, that captures the underlying hardware impairments corresponding to each radiating element, their positional offsets, as well as the mutual coupling effects between antenna elements. Secondly, Gaussian process regression is used to form models from a sparse set of the aforementioned near-field data. Once deployed, the learned non-parametric models effectively serve to continuously transform the beamforming weights of the system, resulting in corrected beam patterns. Lastly, we demonstrate the viability of the described methodology for both digital and analog beamforming antenna arrays of different scales and discuss its further extension to support real-time operation with dynamic hardware impairments.  ( 2 min )
    Teacher Forcing Recovers Reward Functions for Text Generation. (arXiv:2210.08708v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has been widely used in text generation to alleviate the exposure bias issue or to utilize non-parallel datasets. The reward function plays an important role in making RL training successful. However, previous reward functions are typically task-specific and sparse, restricting the use of RL. In our work, we propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing. We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function. Empirical results show that our method outperforms self-training and reward regression methods on several text generation tasks, confirming the effectiveness of our reward function.  ( 2 min )
    Planning and Learning with Adaptive Lookahead. (arXiv:2201.12403v2 [cs.LG] UPDATED)
    Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.  ( 2 min )
    An Analysis of Loss Functions for Binary Classification and Regression. (arXiv:2301.07638v1 [stat.ML])
    This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.  ( 2 min )
    Operator Learning Framework for Digital Twin and Complex Engineering Systems. (arXiv:2301.06701v2 [cs.LG] UPDATED)
    With modern computational advancements and statistical analysis methods, machine learning algorithms have become a vital part of engineering modeling. Neural Operator Networks (ONets) is an emerging machine learning algorithm as a "faster surrogate" for approximating solutions to partial differential equations (PDEs) due to their ability to approximate mathematical operators versus the direct approximation of Neural Networks (NN). ONets use the Universal Approximation Theorem to map finite-dimensional inputs to infinite-dimensional space using the branch-trunk architecture, which encodes domain and feature information separately before using a dot product to combine the information. ONets are expected to occupy a vital niche for surrogate modeling in physical systems and Digital Twin (DT) development. Three test cases are evaluated using ONets for operator approximation, including a 1-dimensional ordinary differential equations (ODE), general diffusion system, and convection-diffusion (Burger) system. Solutions for ODE and diffusion systems yield accurate and reliable results (R2>0.95), while solutions for Burger systems need further refinement in the ONet algorithm.  ( 2 min )
    Joint Representation Learning for Text and 3D Point Cloud. (arXiv:2301.07584v1 [cs.CV])
    Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point  ( 2 min )
    A Bayesian Framework for Digital Twin-Based Control, Monitoring, and Data Collection in Wireless Systems. (arXiv:2212.01351v2 [eess.SP] UPDATED)
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control, monitor, and analyze software-based, "open", communication systems. Notably, DT platforms provide a sandbox in which to test artificial intelligence (AI) solutions for communication systems, potentially reducing the need to collect data and test algorithms in the field, i.e., on the physical twin (PT). A key challenge in the deployment of DT systems is to ensure that virtual control optimization, monitoring, and analysis at the DT are safe and reliable, avoiding incorrect decisions caused by "model exploitation". To address this challenge, this paper presents a general Bayesian framework with the aim of quantifying and accounting for model uncertainty at the DT that is caused by limitations in the amount and quality of data available at the DT from the PT. In the proposed framework, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL), monitoring of the PT for anomaly detection, prediction, data-collection optimization, and counterfactual analysis. To exemplify the application of the proposed framework, we specifically investigate a case-study system encompassing multiple sensing devices that report to a common receiver. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    Learning Task-Oriented Communication for Edge Inference: An Information Bottleneck Approach. (arXiv:2102.04170v3 [eess.SP] UPDATED)
    This paper investigates task-oriented communication for edge inference, where a low-end edge device transmits the extracted feature vector of a local data sample to a powerful edge server for processing. It is critical to encode the data into an informative and compact representation for low-latency inference given the limited bandwidth. We propose a learning-based communication scheme that jointly optimizes feature extraction, source coding, and channel coding in a task-oriented manner, i.e., targeting the downstream inference task rather than data reconstruction. Specifically, we leverage an information bottleneck (IB) framework to formalize a rate-distortion tradeoff between the informativeness of the encoded feature and the inference performance. As the IB optimization is computationally prohibitive for the high-dimensional data, we adopt a variational approximation, namely the variational information bottleneck (VIB), to build a tractable upper bound. To reduce the communication overhead, we leverage a sparsity-inducing distribution as the variational prior for the VIB framework to sparsify the encoded feature vector. Furthermore, considering dynamic channel conditions in practical communication systems, we propose a variable-length feature encoding scheme based on dynamic neural networks to adaptively adjust the activated dimensions of the encoded feature to different channel conditions. Extensive experiments evidence that the proposed task-oriented communication system achieves a better rate-distortion tradeoff than baseline methods and significantly reduces the feature transmission latency in dynamic channel conditions.  ( 2 min )
    Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition. (arXiv:2202.07901v2 [cs.LG] UPDATED)
    Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show improved classification accuracy, faster convergence, and better generalizability.  ( 2 min )
    Theseus: A Library for Differentiable Nonlinear Optimization. (arXiv:2207.09442v3 [cs.RO] UPDATED)
    We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several example applications that are built using the same underlying differentiable components, such as second-order optimizers, standard costs functions, and Lie groups. For efficiency, Theseus incorporates support for sparse solvers, automatic vectorization, batching, GPU acceleration, and gradient computation with implicit differentiation and direct loss minimization. We do extensive performance evaluation in a set of applications, demonstrating significant efficiency gains and better scalability when these features are incorporated. Project page: https://sites.google.com/view/theseus-ai  ( 2 min )
    Non-parametric identifiability and sensitivity analysis of synthetic control models. (arXiv:2301.07656v1 [stat.ME])
    Quantifying cause and effect relationships is an important problem in many domains. The gold standard solution is to conduct a randomised controlled trial. However, in many situations such trials cannot be performed. In the absence of such trials, many methods have been devised to quantify the causal impact of an intervention from observational data given certain assumptions. One widely used method are synthetic control models. While identifiability of the causal estimand in such models has been obtained from a range of assumptions, it is widely and implicitly assumed that the underlying assumptions are satisfied for all time periods both pre- and post-intervention. This is a strong assumption, as synthetic control models can only be learned in pre-intervention period. In this paper we address this challenge, and prove identifiability can be obtained without the need for this assumption, by showing it follows from the principle of invariant causal mechanisms. Moreover, for the first time, we formulate and study synthetic control models in Pearl's structural causal model framework. Importantly, we provide a general framework for sensitivity analysis of synthetic control causal inference to violations of the assumptions underlying non-parametric identifiability. We end by providing an empirical demonstration of our sensitivity analysis framework on simulated and real data in the widely-used linear synthetic control framework.  ( 2 min )
    Comprehensive Literature Survey on Deep Learning used in Image Memorability Prediction and Modification. (arXiv:2301.06080v2 [cs.CV] UPDATED)
    As humans, we can remember certain visuals in great detail, and sometimes even after viewing them once. What is even more interesting is that humans tend to remember and forget the same things, suggesting that there might be some general internal characteristics of an image to encode and discard similar types of information. Research suggests that some pictures tend to be memorized more than others. The ability of an image to be remembered by different viewers is one of its intrinsic properties. In visualization and photography, creating memorable images is a difficult task. Hence, to solve the problem, various techniques predict visual memorability and manipulate images' memorability. We present a comprehensive literature survey to assess the deep learning techniques used to predict and modify memorability. In particular, we analyze the use of Convolutional Neural Networks, Recurrent Neural Networks, and Generative Adversarial Networks for image memorability prediction and modification.  ( 2 min )
    Factors other than climate change are currently more important in predicting how well fruit farms are doing financially. (arXiv:2301.07685v1 [cs.LG])
    Machine learning and statistical modeling methods were used to analyze the impact of climate change on financial wellbeing of fruit farmers in Tunisia and Chile. The analysis was based on face to face interviews with 801 farmers. Three research questions were investigated. First, whether climate change impacts had an effect on how well the farm was doing financially. Second, if climate change was not influential, what factors were important for predicting financial wellbeing of the farm. And third, ascertain whether observed effects on the financial wellbeing of the farm were a result of interactions between predictor variables. This is the first report directly comparing climate change with other factors potentially impacting financial wellbeing of farms. Certain climate change factors, namely increases in temperature and reductions in precipitation, can regionally impact self-perceived financial wellbeing of fruit farmers. Specifically, increases in temperature and reduction in precipitation can have a measurable negative impact on the financial wellbeing of farms in Chile. This effect is less pronounced in Tunisia. Climate impact differences were observed within Chile but not in Tunisia. However, climate change is only of minor importance for predicting farm financial wellbeing, especially for farms already doing financially well. Factors that are more important, mainly in Tunisia, included trust in information sources and prior farm ownership. Other important factors include farm size, water management systems used and diversity of fruit crops grown. Moreover, some of the important factors identified differed between farms doing and not doing well financially. Interactions between factors may improve or worsen farm financial wellbeing.  ( 2 min )
    An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. (arXiv:2103.15990v7 [cs.HC] UPDATED)
    With the rapid development of the internet of things (IoT) and artificial intelligence (AI) technologies, human activity recognition (HAR) has been applied in a variety of domains such as security and surveillance, human-robot interaction, and entertainment. Even though a number of surveys and review papers have been published, there is a lack of HAR overview papers focusing on healthcare applications that use wearable sensors. Therefore, we fill in the gap by presenting this overview paper. In particular, we present our projects to illustrate the system design of HAR applications for healthcare. Our projects include early mobility identification of human activities for intensive care unit (ICU) patients and gait analysis of Duchenne muscular dystrophy (DMD) patients. We cover essential components of designing HAR systems including sensor factors (e.g., type, number, and placement location), AI model selection (e.g., classical machine learning models versus deep learning models), and feature engineering. In addition, we highlight the challenges of such healthcare-oriented HAR systems and propose several research opportunities for both the medical and the computer science community.  ( 2 min )
    Improving Federated Learning Personalization via Model Agnostic Meta Learning. (arXiv:1909.12488v2 [cs.LG] UPDATED)
    Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research.  ( 2 min )
    Creating awareness about security and safety on highways to mitigate wildlife-vehicle collisions by detecting and recognizing wildlife fences using deep learning and drone technology. (arXiv:2301.07174v1 [cs.CV])
    In South Africa, it is a common practice for people to leave their vehicles beside the road when traveling long distances for a short comfort break. This practice might increase human encounters with wildlife, threatening their security and safety. Here we intend to create awareness about wildlife fencing, using drone technology and computer vision algorithms to recognize and detect wildlife fences and associated features. We collected data at Amakhala and Lalibela private game reserves in the Eastern Cape, South Africa. We used wildlife electric fence data containing single and double fences for the classification task. Additionally, we used aerial and still annotated images extracted from the drone and still cameras for the segmentation and detection tasks. The model training results from the drone camera outperformed those from the still camera. Generally, poor model performance is attributed to (1) over-decompression of images and (2) the ability of drone cameras to capture more details on images for the machine learning model to learn as compared to still cameras that capture only the front view of the wildlife fence. We argue that our model can be deployed on client-edge devices to inform people about the presence and significance of wildlife fencing, which minimizes human encounters with wildlife, thereby mitigating wildlife-vehicle collisions.  ( 2 min )
    Enhancing Self-Training Methods. (arXiv:2301.07294v1 [cs.LG])
    Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data. Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias" that occurs when the student model repeatedly overfits to incorrect pseudo-labels given by the teacher model for the unlabeled data. This bias impedes improvements in pseudo-label accuracy across self-training iterations, leading to unwanted saturation in model performance after just a few iterations. In this work, we describe multiple enhancements to improve the self-training pipeline to mitigate the effect of confirmation bias. We evaluate our enhancements over multiple datasets showing performance gains over existing self-training design choices. Finally, we also study the extendability of our enhanced approach to Open Set unlabeled data (containing classes not seen in labeled data).  ( 2 min )
    Safety Verification of Neural Network Control Systems Using Guaranteed Neural Network Model Reduction. (arXiv:2301.07531v1 [cs.LG])
    This paper aims to enhance the computational efficiency of safety verification of neural network control systems by developing a guaranteed neural network model reduction method. First, a concept of model reduction precision is proposed to describe the guaranteed distance between the outputs of a neural network and its reduced-size version. A reachability-based algorithm is proposed to accurately compute the model reduction precision. Then, by substituting a reduced-size neural network controller into the closed-loop system, an algorithm to compute the reachable set of the original system is developed, which is able to support much more computationally efficient safety verification processes. Finally, the developed methods are applied to a case study of the Adaptive Cruise Control system with a neural network controller, which is shown to significantly reduce the computational time of safety verification and thus validate the effectiveness of the method.  ( 2 min )
    Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams. (arXiv:2202.08312v3 [cs.LG] UPDATED)
    Motivated by recent applications requiring differential privacy over adaptive streams, we investigate the question of optimal instantiations of the matrix mechanism in this setting. We prove fundamental theoretical results on the applicability of matrix factorizations to adaptive streams, and provide a parameter-free fixed-point algorithm for computing optimal factorizations. We instantiate this framework with respect to concrete matrices which arise naturally in machine learning, and train user-level differentially private models with the resulting optimal mechanisms, yielding significant improvements in a notable problem in federated learning with user-level differential privacy.  ( 2 min )
    Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. (arXiv:2210.07675v3 [cs.CV] UPDATED)
    We present a system for anomaly detection in histopathological images. In histology, normal samples are usually abundant, whereas anomalous (pathological) cases are scarce or not available. Under such settings, one-class classifiers trained on healthy data can detect out-of-distribution anomalous samples. Such approaches combined with pre-trained Convolutional Neural Network (CNN) representations of images were previously employed for anomaly detection (AD). However, pre-trained off-the-shelf CNN representations may not be sensitive to abnormal conditions in tissues, while natural variations of healthy tissue may result in distant representations. To adapt representations to relevant details in healthy tissue we propose training a CNN on an auxiliary task that discriminates healthy tissue of different species, organs, and staining reagents. Almost no additional labeling workload is required, since healthy samples come automatically with aforementioned labels. During training we enforce compact image representations with a center-loss term, which further improves representations for AD. The proposed system outperforms established AD methods on a published dataset of liver anomalies. Moreover, it provided comparable results to conventional methods specifically tailored for quantification of liver anomalies. We show that our approach can be used for toxicity assessment of candidate drugs at early development stages and thereby may reduce expensive late-stage drug attrition.  ( 2 min )
    Universal Neural-Cracking-Machines: Self-Configurable Password Models from Auxiliary Data. (arXiv:2301.07628v1 [cs.CR])
    We develop the first universal password model -- a password model that, once pre-trained, can automatically adapt to any password distribution. To achieve this result, the model does not need to access any plaintext passwords from the target set. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying target password distribution. The model uses deep learning to capture the correlation between the auxiliary data of a group of users (e.g., users of a web application) and their passwords. It then exploits those patterns to create a tailored password model for the target community at inference time. No further training steps, targeted data collection, or prior knowledge of the community's password distribution is required. Besides defining a new state-of-the-art for password strength estimation, our model enables any end-user (e.g., system administrators) to autonomously generate tailored password models for their systems without the often unworkable requirement of collecting suitable training data and fitting the underlying password model. Ultimately, our framework enables the democratization of well-calibrated password models to the community, addressing a major challenge in the deployment of password security solutions on a large scale.  ( 2 min )
    Feature Alignment as a Generative Process. (arXiv:2106.12562v2 [cs.LG] UPDATED)
    Reversibility in artificial neural networks allows us to retrieve the input given an output. We present feature alignment, a method for approximating reversibility in arbitrary neural networks. We train a network by minimizing the distance between the output of a data point and the random output with respect to a random input. We applied the technique to the MNIST, CIFAR-10, CelebA and STL-10 image datasets. We demonstrate that this method can roughly recover images from just their latent representation without the need of a decoder. By utilizing the formulation of variational autoencoders, we demonstrate that it is possible to produce new images that are statistically comparable to the training data. Furthermore, we demonstrate that the quality of the images can be improved by coupling a generator and a discriminator together. In addition, we show how this method, with a few minor modifications, can be used to train networks locally, which has the potential to save computational memory resources.  ( 2 min )
    Cross-Domain Evaluation of a Deep Learning-Based Type Inference System. (arXiv:2208.09189v2 [cs.SE] UPDATED)
    Optional type annotations allow for enriching dynamic programming languages with static typing features like better Integrated Development Environment (IDE) support, more precise program analysis, and early detection and prevention of type-related runtime errors. Machine learning-based type inference promises interesting results for automating this task. However, the practical usage of such systems depends on their ability to generalize across different domains, as they are often applied outside their training domain. In this work, we investigate Type4Py as a representative of state-of-the-art deep learning-based type inference systems, by conducting extensive cross-domain experiments. Thereby, we address the following problems: class imbalances, out-of-vocabulary words, dataset shifts, and unknown classes. To perform such experiments, we use the datasets ManyTypes4Py and CrossDomainTypes4Py. The latter we introduce in this paper. Our dataset enables the evaluation of type inference systems in different domains of software projects and has over 1,000,000 type annotations mined on the platforms GitHub and Libraries. It consists of data from the two domains web development and scientific calculation. Through our experiments, we detect that the shifts in the dataset and the long-tailed distribution with many rare and unknown data types decrease the performance of the deep learning-based type inference system drastically. In this context, we test unsupervised domain adaptation methods and fine-tuning to overcome these issues. Moreover, we investigate the impact of out-of-vocabulary words.  ( 2 min )
    InstructPix2Pix: Learning to Follow Image Editing Instructions. (arXiv:2211.09800v2 [cs.CV] UPDATED)
    We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.  ( 2 min )
    Sample Complexity of Adversarially Robust Linear Classification on Separated Data. (arXiv:2012.10794v3 [cs.LG] UPDATED)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).  ( 2 min )
    An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms. (arXiv:2301.07665v1 [cs.SD])
    In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand. These representations can be used to manipulate the timbre and influence the synthesis of creative instrumental notes. Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument timbre compression. Unsupervised deep learning methods can achieve audio compression by training the network to learn a mapping from waveforms or spectrograms to low-dimensional representations. This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch. Further exploration of hyper-parameters and regularization techniques is demonstrated to enhance the performance of the initial design. In an unsupervised manner, the network is able to reconstruct a monophonic and harmonic sound based on latent representations. In addition, we introduce an evaluation metric to measure the similarity between the original and reconstructed samples. Evaluating a deep generative model for the synthesis of sound is a challenging task. Our approach is based on the accuracy of the generated frequencies as it presents a significant metric for the perception of harmonic sounds. This work is expected to accelerate future experiments on audio compression using neural autoencoders.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v1 [eess.SP] CROSS LISTED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
    Concrete Score Matching: Generalized Score Matching for Discrete Data. (arXiv:2211.00802v2 [cs.LG] UPDATED)
    Representing probability distributions by the gradient of their density functions has proven effective in modeling a wide range of continuous data modalities. However, this representation is not applicable in discrete domains where the gradient is undefined. To this end, we propose an analogous score function called the "Concrete score", a generalization of the (Stein) score for discrete settings. Given a predefined neighborhood structure, the Concrete score of any input is defined by the rate of change of the probabilities with respect to local directional changes of the input. This formulation allows us to recover the (Stein) score in continuous domains when measuring such changes by the Euclidean distance, while using the Manhattan distance leads to our novel score function in discrete domains. Finally, we introduce a new framework to learn such scores from samples called Concrete Score Matching (CSM), and propose an efficient training objective to scale our approach to high dimensions. Empirically, we demonstrate the efficacy of CSM on density estimation tasks on a mixture of synthetic, tabular, and high-dimensional image datasets, and demonstrate that it performs favorably relative to existing baselines for modeling discrete data.  ( 2 min )
    Multimodal learning with graphs. (arXiv:2209.03299v5 [cs.LG] UPDATED)
    Artificial intelligence for graphs has achieved remarkable success in modeling complex systems, ranging from dynamic networks in biology to interacting particle systems in physics. However, the increasingly heterogeneous graph datasets call for multimodal methods that can combine different inductive biases: the set of assumptions that algorithms use to make predictions for inputs they have not encountered during training. Learning on multimodal datasets presents fundamental challenges because the inductive biases can vary by data modality and graphs might not be explicitly given in the input. To address these challenges, multimodal graph AI methods combine different modalities while leveraging cross-modal dependencies using graphs. Diverse datasets are combined using graphs and fed into sophisticated multimodal architectures, specified as image-intensive, knowledge-grounded and language-intensive models. Using this categorization, we introduce a blueprint for multimodal graph learning, use it to study existing methods and provide guidelines to design new models.  ( 2 min )
    Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning. (arXiv:2206.02450v2 [cs.IT] UPDATED)
    Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...  ( 3 min )
    Hybrid quantum-classical convolutional neural networks to improve molecular protein binding affinity predictions. (arXiv:2301.06331v2 [quant-ph] UPDATED)
    One of the main challenges in drug discovery is to find molecules that bind specifically and strongly to their target protein while having minimal binding to other proteins. By predicting binding affinity, it is possible to identify the most promising candidates from a large pool of potential compounds, reducing the number of compounds that need to be tested experimentally. Recently, deep learning methods have shown superior performance than traditional computational methods for making accurate predictions on large datasets. However, the complexity and time-consuming nature of these methods have limited their usage and development. Quantum machine learning is an emerging technology that has the potential to improve many classical machine learning algorithms. In this work we present a hybrid quantum-classical convolutional neural network, which is able to reduce by 20% the complexity of the classical network while maintaining optimal performance in the predictions. Additionally, it results in a significant time savings of up to 40% in the training process, which means a meaningful speed up of the drug discovery process.  ( 2 min )
    Consistent Non-Parametric Methods for Maximizing Robustness. (arXiv:2102.09086v3 [cs.LG] UPDATED)
    Learning classifiers that are robust to adversarial examples has received a great deal of recent attention. A major drawback of the standard robust learning framework is there is an artificial robustness radius $r$ that applies to all inputs. This ignores the fact that data may be highly heterogeneous, in which case it is plausible that robustness regions should be larger in some regions of data, and smaller in others. In this paper, we address this limitation by proposing a new limit classifier, called the neighborhood optimal classifier, that extends the Bayes optimal classifier outside its support by using the label of the closest in-support point. We then argue that this classifier maximizes the size of its robustness regions subject to the constraint of having accuracy equal to the Bayes optimal. We then present sufficient conditions under which general non-parametric methods that can be represented as weight functions converge towards this limit, and show that both nearest neighbors and kernel classifiers satisfy them under certain conditions.  ( 2 min )
    ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training. (arXiv:2301.07482v1 [cs.LG])
    A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the inevitable shortcuts involved. To address these limitations, we instead propose ReFresh, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, ReFresh is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 4.6x up to 23.6x and reduce the memory access by 64.5% (85.7% higher than a raw feature cache), with less than 1% influence on test accuracy.  ( 2 min )
    Strong inductive biases provably prevent harmless interpolation. (arXiv:2301.07605v1 [stat.ML])
    Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -- a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.  ( 2 min )
    What relations are reliably embeddable in Euclidean space?. (arXiv:1903.05347v3 [cs.LG] UPDATED)
    We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.  ( 2 min )
    Neural Network Quantization for Efficient Inference: A Survey. (arXiv:2112.06126v2 [cs.LG] UPDATED)
    As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.  ( 2 min )
    Prediction of Red Wine Quality Using One-dimensional Convolutional Neural Networks. (arXiv:2208.14008v2 [cs.LG] UPDATED)
    As an alcoholic beverage, wine has remained prevalent for thousands of years, and the quality assessment of wines has been significant in wine production and trade. Scholars have proposed various deep learning and machine learning algorithms for wine quality prediction, such as Support vector machine (SVM), Random Forest (RF), K-nearest neighbors (KNN), Deep neural network (DNN), and Logistic regression (LR). However, these methods ignore the inner relationship between the physical and chemical properties of the wine, for example, the correlations between pH values, fixed acidity, citric acid, and so on. To fill the gap, this paper conducts the Pearson correlation analysis, PCA analysis, and Shapiro-Wilk test on those properties and incorporates 1D-CNN architecture to capture the correlations among neighboring features. In addition, it implemented dropout and batch normalization techniques to improve the robustness of the proposed model. Massive experiments have shown that our method can outperform baseline approaches in wine quality prediction. Moreover, ablation experiments also demonstrate the effectiveness of incorporating the 1-D CNN module, Dropout, and normalization techniques.  ( 2 min )
    Performance-Preserving Event Log Sampling for Predictive Monitoring. (arXiv:2301.07624v1 [cs.LG])
    Predictive process monitoring is a subfield of process mining that aims to estimate case or event features for running process instances. Such predictions are of significant interest to the process stakeholders. However, most of the state-of-the-art methods for predictive monitoring require the training of complex machine learning models, which is often inefficient. Moreover, most of these methods require a hyper-parameter optimization that requires several repetitions of the training process which is not feasible in many real-life applications. In this paper, we propose an instance selection procedure that allows sampling training process instances for prediction models. We show that our instance selection procedure allows for a significant increase of training speed for next activity and remaining time prediction methods while maintaining reliable levels of prediction accuracy.  ( 2 min )
    Concentration of polynomial random matrices via Efron-Stein inequalities. (arXiv:2209.02655v2 [cs.CC] UPDATED)
    Analyzing concentration of large random matrices is a common task in a wide variety of fields. Given independent random variables, many tools are available to analyze random matrices whose entries are linear in the variables, e.g. the matrix-Bernstein inequality. However, in many applications, we need to analyze random matrices whose entries are polynomials in the variables. These arise naturally in the analysis of spectral algorithms, e.g., Hopkins et al. [STOC 2016], Moitra-Wein [STOC 2019]; and in lower bounds for semidefinite programs based on the Sum of Squares hierarchy, e.g. Barak et al. [FOCS 2016], Jones et al. [FOCS 2021]. In this work, we present a general framework to obtain such bounds, based on the matrix Efron-Stein inequalities developed by Paulin-Mackey-Tropp [Annals of Probability 2016]. The Efron-Stein inequality bounds the norm of a random matrix by the norm of another simpler (but still random) matrix, which we view as arising by "differentiating" the starting matrix. By recursively differentiating, our framework reduces the main task to analyzing far simpler matrices. For Rademacher variables, these simpler matrices are in fact deterministic and hence, analyzing them is far easier. For general non-Rademacher variables, the task reduces to scalar concentration, which is much easier. Moreover, in the setting of polynomial matrices, our results generalize the work of Paulin-Mackey-Tropp. Using our basic framework, we recover known bounds in the literature for simple "tensor networks" and "dense graph matrices". Using our general framework, we derive bounds for "sparse graph matrices", which were obtained only recently by Jones et al. [FOCS 2021] using a nontrivial application of the trace power method, and was a core component in their work. We expect our framework to be helpful for other applications involving concentration phenomena for nonlinear random matrices.  ( 3 min )
    Multimodal Side-Tuning for Document Classification. (arXiv:2301.07502v1 [cs.LG])
    In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.  ( 2 min )
    Non-IID Quantum Federated Learning with One-shot Communication Complexity. (arXiv:2209.00768v2 [quant-ph] UPDATED)
    Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients' data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.  ( 2 min )
    Targeted Image Reconstruction by Sampling Pre-trained Diffusion Model. (arXiv:2301.07557v1 [cs.LG])
    A trained neural network model contains information on the training data. Given such a model, malicious parties can leverage the "knowledge" in this model and design ways to print out any usable information (known as model inversion attack). Therefore, it is valuable to explore the ways to conduct a such attack and demonstrate its severity. In this work, we proposed ways to generate a data point of the target class without prior knowledge of the exact target distribution by using a pre-trained diffusion model.  ( 2 min )
    A Novel, Scale-Invariant, Differentiable, Efficient, Scalable Regularizer. (arXiv:2301.07285v1 [cs.LG])
    $L_{p}$-norm regularization schemes such as $L_{0}$, $L_{1}$, and $L_{2}$-norm regularization and $L_{p}$-norm-based regularization techniques such as weight decay and group LASSO compute a quantity which de pends on model weights considered in isolation from one another. This paper describes a novel regularizer which is not based on an $L_{p}$-norm. In contrast with $L_{p}$-norm-based regularization, this regularizer is concerned with the spatial arrangement of weights within a weight matrix. This regularizer is an additive term for the loss function and is differentiable, simple and fast to compute, scale-invariant, requires a trivial amount of additional memory, and can easily be parallelized. Empirically this method yields approximately a one order-of-magnitude improvement in the number of nonzero model parameters at a given level of accuracy.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v3 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that approximates the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations.  ( 2 min )
    Failure Tolerant Training with Persistent Memory Disaggregation over CXL. (arXiv:2301.07492v1 [cs.AR])
    This paper proposes TRAININGCXL that can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead. To this end, i) we integrate persistent memory (PMEM) and GPU into a cache-coherent domain as Type-2. Enabling CXL allows PMEM to be directly placed in GPU's memory hierarchy, such that GPU can access PMEM without software intervention. TRAININGCXL introduces computing and checkpointing logic near the CXL controller, thereby training data and managing persistency in an active manner. Considering PMEM's vulnerability, ii) we utilize the unique characteristics of recommendation models and take the checkpointing overhead off the critical path of their training. Lastly, iii) TRAININGCXL employs an advanced checkpointing technique that relaxes the updating sequence of model parameters and embeddings across training batches. The evaluation shows that TRAININGCXL achieves 5.2x training performance improvement and 76% energy savings, compared to the modern PMEM-based recommendation systems.  ( 2 min )
    Compression of GPS Trajectories using Autoencoders. (arXiv:2301.07420v1 [cs.LG])
    The ubiquitous availability of mobile devices capable of location tracking led to a significant rise in the collection of GPS data. Several compression methods have been developed in order to reduce the amount of storage needed while keeping the important information. In this paper, we present an lstm-autoencoder based approach in order to compress and reconstruct GPS trajectories, which is evaluated on both a gaming and real-world dataset. We consider various compression ratios and trajectory lengths. The performance is compared to other trajectory compression algorithms, i.e., Douglas-Peucker. Overall, the results indicate that our approach outperforms Douglas-Peucker significantly in terms of the discrete Fr\'echet distance and dynamic time warping. Furthermore, by reconstructing every point lossy, the proposed methodology offers multiple advantages over traditional methods.  ( 2 min )
    No-substitution k-means Clustering with Adversarial Order. (arXiv:2012.14512v2 [cs.DS] UPDATED)
    We investigate $k$-means clustering in the online no-substitution setting when the input arrives in \emph{arbitrary} order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the $k$-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive -- it does not include natural input generated from mixture models. We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity $d$, the algorithm takes $O(d\log(n) k\log(k))$ centers and is an $O(k^3)$-approximation. We also prove that if the data is sampled from a ``natural" distribution, such as a mixture of $k$ Gaussians, then the new complexity measure is equal to $O(k^2\log(n))$. This implies that for data generated from those distributions, our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a $\text{poly}(k)$-approximation. In terms of negative results, we prove that the number of centers needed to achieve an $\alpha$-approximation is at least $\Omega\left(\frac{d}{k\log(n\alpha)}\right)$.  ( 2 min )
    Quantum-inspired tensor network for Earth science. (arXiv:2301.07528v1 [physics.geo-ph])
    Deep Learning (DL) is one of many successful methodologies to extract informative patterns and insights from ever increasing noisy large-scale datasets (in our case, satellite images). However, DL models consist of a few thousand to millions of training parameters, and these training parameters require tremendous amount of electrical power for extracting informative patterns from noisy large-scale datasets (e.g., computationally expensive). Hence, we employ a quantum-inspired tensor network for compressing trainable parameters of physics-informed neural networks (PINNs) in Earth science. PINNs are DL models penalized by enforcing the law of physics; in particular, the law of physics is embedded in DL models. In addition, we apply tensor decomposition to HyperSpectral Images (HSIs) to improve their spectral resolution. A quantum-inspired tensor network is also the native formulation to efficiently represent and train quantum machine learning models on big datasets on GPU tensor cores. Furthermore, the key contribution of this paper is twofold: (I) we reduced a number of trainable parameters of PINNs by using a quantum-inspired tensor network, and (II) we improved the spectral resolution of remotely-sensed images by employing tensor decomposition. As a benchmark PDE, we solved Burger's equation. As practical satellite data, we employed HSIs of Indian Pine, USA and of Pavia University, Italy.  ( 2 min )
    Landscape Complexity for the Empirical Risk of Generalized Linear Models. (arXiv:1912.02143v5 [stat.ML] UPDATED)
    We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v5 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    A Robust Classification Framework for Byzantine-Resilient Stochastic Gradient Descent. (arXiv:2301.07498v1 [cs.LG])
    This paper proposes a Robust Gradient Classification Framework (RGCF) for Byzantine fault tolerance in distributed stochastic gradient descent. The framework consists of a pattern recognition filter which we train to be able to classify individual gradients as Byzantine by using their direction alone. This filter is robust to an arbitrary number of Byzantine workers for convex as well as non-convex optimisation settings, which is a significant improvement on the prior work that is robust to Byzantine faults only when up to 50% of the workers are Byzantine. This solution does not require an estimate of the number of Byzantine workers; its running time is not dependent on the number of workers and can scale up to training instances with a large number of workers without a loss in performance. We validate our solution by training convolutional neural networks on the MNIST dataset in the presence of Byzantine workers.  ( 2 min )
    Autonomous Slalom Maneuver Based on Expert Drivers' Behavior Using Convolutional Neural Network. (arXiv:2301.07424v1 [cs.RO])
    Lane changing and obstacle avoidance are one of the most important tasks in automated cars. To date, many algorithms have been suggested that are generally based on path trajectory or reinforcement learning approaches. Although these methods have been efficient, they are not able to accurately imitate a smooth path traveled by an expert driver. In this paper, a method is presented to mimic drivers' behavior using a convolutional neural network (CNN). First, seven features are extracted from a dataset gathered from four expert drivers in a driving simulator. Then, these features are converted from 1D arrays to 2D arrays and injected into a CNN. The CNN model computes the desired steering wheel angle and sends it to an adaptive PD controller. Finally, the control unit applies proper torque to the steering wheel. Results show that the CNN model can mimic the drivers' behavior with an R2-squared of 0.83. Also, the performance of the presented method was evaluated in the driving simulator for 17 trials, which avoided all traffic cones successfully. In some trials, the presented method performed a smoother maneuver compared to the expert drivers.  ( 2 min )
    A Survey of Advanced Computer Vision Techniques for Sports. (arXiv:2301.07583v1 [cs.CV])
    Computer Vision developments are enabling significant advances in many fields, including sports. Many applications built on top of Computer Vision technologies, such as tracking data, are nowadays essential for every top-level analyst, coach, and even player. In this paper, we survey Computer Vision techniques that can help many sports-related studies gather vast amounts of data, such as Object Detection and Pose Estimation. We provide a use case for such data: building a model for shot speed estimation with pose data obtained using only Computer Vision models. Our model achieves a correlation of 67%. The possibility of estimating shot speeds enables much deeper studies about enabling the creation of new metrics and recommendation systems that will help athletes improve their performance, in any sport. The proposed methodology is easily replicable for many technical movements and is only limited by the availability of video data.  ( 2 min )
    Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification. (arXiv:2301.07609v1 [stat.ML])
    Data-driven approaches coupled with physical knowledge are powerful techniques to model systems. The goal of such models is to efficiently solve for the underlying field by combining measurements with known physical laws. As many systems contain unknown elements, such as missing parameters, noisy data, or incomplete physical laws, this is widely approached as an uncertainty quantification problem. The common techniques to handle all the variables typically depend on the numerical scheme used to approximate the posterior, and it is desirable to have a method which is independent of any such discretization. Information field theory (IFT) provides the tools necessary to perform statistics over fields that are not necessarily Gaussian. We extend IFT to physics-informed IFT (PIFT) by encoding the functional priors with information about the physical laws which describe the field. The posteriors derived from this PIFT remain independent of any numerical scheme and can capture multiple modes, allowing for the solution of problems which are ill-posed. We demonstrate our approach through an analytical example involving the Klein-Gordon equation. We then develop a variant of stochastic gradient Langevin dynamics to draw samples from the joint posterior over the field and model parameters. We apply our method to numerical examples with various degrees of model-form error and to inverse problems involving nonlinear differential equations. As an addendum, the method is equipped with a metric which allows the posterior to automatically quantify model-form uncertainty. Because of this, our numerical experiments show that the method remains robust to even an incorrect representation of the physics given sufficient data. We numerically demonstrate that the method correctly identifies when the physics cannot be trusted, in which case it automatically treats learning the field as a regression problem.  ( 2 min )
    Adaptively Integrated Knowledge Distillation and Prediction Uncertainty for Continual Learning. (arXiv:2301.07316v1 [cs.CV])
    Current deep learning models often suffer from catastrophic forgetting of old knowledge when continually learning new knowledge. Existing strategies to alleviate this issue often fix the trade-off between keeping old knowledge (stability) and learning new knowledge (plasticity). However, the stability-plasticity trade-off during continual learning may need to be dynamically changed for better model performance. In this paper, we propose two novel ways to adaptively balance model stability and plasticity. The first one is to adaptively integrate multiple levels of old knowledge and transfer it to each block level in the new model. The second one uses prediction uncertainty of old knowledge to naturally tune the importance of learning new knowledge during model training. To our best knowledge, this is the first time to connect model prediction uncertainty and knowledge distillation for continual learning. In addition, this paper applies a modified CutMix particularly to augment the data for old knowledge, further alleviating the catastrophic forgetting issue. Extensive evaluations on the CIFAR100 and the ImageNet datasets confirmed the effectiveness of the proposed method for continual learning.  ( 2 min )
    AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry. (arXiv:2301.07526v1 [cs.LG])
    In the insurance industry detecting fraudulent claims is a critical task with a significant financial impact. A common strategy to identify fraudulent claims is looking for inconsistencies in the supporting evidence. However, this is a laborious and cognitively heavy task for human experts as insurance claims typically come with a plethora of data from different modalities (e.g. images, text and metadata). To overcome this challenge, the research community has focused on multimodal machine learning frameworks that can efficiently reason through multiple data sources. Despite recent advances in multimodal learning, these frameworks still suffer from (i) challenges of joint-training caused by the different characteristics of different modalities and (ii) overfitting tendencies due to high model complexity. In this work, we address these challenges by introducing a multimodal reasoning framework, AutoFraudNet (Automobile Insurance Fraud Detection Network), for detecting fraudulent auto-insurance claims. AutoFraudNet utilizes a cascaded slow fusion framework and state-of-the-art fusion block, BLOCK Tucker, to alleviate the challenges of joint-training. Furthermore, it incorporates a light-weight architectural design along with additional losses to prevent overfitting. Through extensive experiments conducted on a real-world dataset, we demonstrate: (i) the merits of multimodal approaches, when compared to unimodal and bimodal methods, and (ii) the effectiveness of AutoFraudNet in fusing various modalities to boost performance (over 3\% in PR AUC).  ( 2 min )
    Training Semantic Segmentation on Heterogeneous Datasets. (arXiv:2301.07634v1 [cs.CV])
    We explore semantic segmentation beyond the conventional, single-dataset homogeneous training and bring forward the problem of Heterogeneous Training of Semantic Segmentation (HTSS). HTSS involves simultaneous training on multiple heterogeneous datasets, i.e. datasets with conflicting label spaces and different (weak) annotation types from the perspective of semantic segmentation. The HTSS formulation exposes deep networks to a larger and previously unexplored aggregation of information that can potentially enhance semantic segmentation in three directions: i) performance: increased segmentation metrics on seen datasets, ii) generalization: improved segmentation metrics on unseen datasets, and iii) knowledgeability: increased number of recognizable semantic concepts. To research these benefits of HTSS, we propose a unified framework, that incorporates heterogeneous datasets in a single-network training pipeline following the established FCN standard. Our framework first curates heterogeneous datasets to bring them into a common format and then trains a single-backbone FCN on all of them simultaneously. To achieve this, it transforms weak annotations, which are incompatible with semantic segmentation, to per-pixel labels, and hierarchizes their label spaces into a universal taxonomy. The trained HTSS models demonstrate performance and generalization gains over a wide range of datasets and extend the inference label space entailing hundreds of semantic classes.  ( 2 min )
    Curvilinear object segmentation in medical images based on ODoS filter and deep learning network. (arXiv:2301.07475v1 [eess.IV])
    Automatic segmentation of curvilinear objects in medical images plays an important role in the diagnosis and evaluation of human diseases, yet it is a challenging uncertainty for the complex segmentation task due to different issues like various image appearance, low contrast between curvilinear objects and their surrounding backgrounds, thin and uneven curvilinear structures, and improper background illumination. To overcome these challenges, we present a unique curvilinear structure segmentation framework based on oriented derivative of stick (ODoS) filter and deep learning network for curvilinear object segmentation in medical images. Currently, a large number of deep learning models emphasis on developing deep architectures and ignore capturing the structural features of curvature objects, which may lead to unsatisfactory results. In consequence, a new approach that incorporates the ODoS filter as part of a deep learning network is presented to improve the spatial attention of curvilinear objects. In which, the original image is considered as principal part to describe various image appearance and complex background illumination, the multi-step strategy is used to enhance contrast between curvilinear objects and their surrounding backgrounds, and the vector field is applied to discriminate thin and uneven curvilinear structures. Subsequently, a deep learning framework is employed to extract varvious structural features for curvilinear object segmentation in medical images. The performance of the computational model was validated in experiments with publicly available DRIVE, STARE and CHASEDB1 datasets. Experimental results indicate that the presented model has yielded surprising results compared with some state-of-the-art methods.  ( 2 min )
    Model-free machine learning of conservation laws from data. (arXiv:2301.07503v1 [cs.LG])
    We present a machine learning based method for learning first integrals of systems of ordinary differential equations from given trajectory data. The method is model-free in that it does not require explicit knowledge of the underlying system of differential equations that generated the trajectories. As a by-product, once the first integrals have been learned, also the system of differential equations will be known. We illustrate our method by considering several classical problems from the mathematical sciences.  ( 2 min )
    Machine learning techniques for the Schizophrenia diagnosis: A comprehensive review and future research directions. (arXiv:2301.07496v1 [cs.LG])
    Schizophrenia (SCZ) is a brain disorder where different people experience different symptoms, such as hallucination, delusion, flat-talk, disorganized thinking, etc. In the long term, this can cause severe effects and diminish life expectancy by more than ten years. Therefore, early and accurate diagnosis of SCZ is prevalent, and modalities like structural magnetic resonance imaging (sMRI), functional MRI (fMRI), diffusion tensor imaging (DTI), and electroencephalogram (EEG) assist in witnessing the brain abnormalities of the patients. Moreover, for accurate diagnosis of SCZ, researchers have used machine learning (ML) algorithms for the past decade to distinguish the brain patterns of healthy and SCZ brains using MRI and fMRI images. This paper seeks to acquaint SCZ researchers with ML and to discuss its recent applications to the field of SCZ study. This paper comprehensively reviews state-of-the-art techniques such as ML classifiers, artificial neural network (ANN), deep learning (DL) models, methodological fundamentals, and applications with previous studies. The motivation of this paper is to benefit from finding the research gaps that may lead to the development of a new model for accurate SCZ diagnosis. The paper concludes with the research finding, followed by the future scope that directly contributes to new research directions.  ( 2 min )
    Local Learning with Neuron Groups. (arXiv:2301.07635v1 [cs.LG])
    Traditional deep network training methods optimize a monolithic objective function jointly for all the components. This can lead to various inefficiencies in terms of potential parallelization. Local learning is an approach to model-parallelism that removes the standard end-to-end learning setup and utilizes local objective functions to permit parallel learning amongst model components in a deep network. Recent works have demonstrated that variants of local learning can lead to efficient training of modern deep networks. However, in terms of how much computation can be distributed, these approaches are typically limited by the number of layers in a network. In this work we propose to study how local learning can be applied at the level of splitting layers or modules into sub-components, adding a notion of width-wise modularity to the existing depth-wise modularity associated with local learning. We investigate local-learning penalties that permit such models to be trained efficiently. Our experiments on the CIFAR-10, CIFAR-100, and Imagenet32 datasets demonstrate that introducing width-level modularity can lead to computational advantages over existing methods based on local learning and opens new opportunities for improved model-parallel distributed training. Code is available at: https://github.com/adeetyapatel12/GN-DGL.  ( 2 min )
    Human-Timescale Adaptation in an Open-Ended Task Space. (arXiv:2301.07608v1 [cs.LG])
    Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.  ( 2 min )
    Synthcity: facilitating innovative use cases of synthetic data in different data modalities. (arXiv:2301.07573v1 [cs.LG])
    Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation across diverse tabular data modalities, including static data, regular and irregular time series, data with censoring, multi-source data, composite data, and more. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data. It also offers the community a playground for rapid experimentation and prototyping, a one-stop-shop for SOTA benchmarks, and an opportunity for extending research impact. The library can be accessed on GitHub (https://github.com/vanderschaarlab/synthcity) and pip (https://pypi.org/project/synthcity/). We warmly invite the community to join the development effort by providing feedback, reporting bugs, and contributing code.  ( 2 min )
    Reslicing Ultrasound Images for Data Augmentation and Vessel Reconstruction. (arXiv:2301.07286v1 [eess.IV])
    Robot-guided catheter insertion has the potential to deliver urgent medical care in situations where medical personnel are unavailable. However, this technique requires accurate and reliable segmentation of anatomical landmarks in the body. For the ultrasound imaging modality, obtaining large amounts of training data for a segmentation model is time-consuming and expensive. This paper introduces RESUS (RESlicing of UltraSound Images), a weak supervision data augmentation technique for ultrasound images based on slicing reconstructed 3D volumes from tracked 2D images. This technique allows us to generate views which cannot be easily obtained in vivo due to physical constraints of ultrasound imaging, and use these augmented ultrasound images to train a semantic segmentation model. We demonstrate that RESUS achieves statistically significant improvement over training with non-augmented images and highlight qualitative improvements through vessel reconstruction.  ( 2 min )
    Learning a Formality-Aware Japanese Sentence Representation. (arXiv:2301.07209v1 [cs.CL])
    While the way intermediate representations are generated in encoder-decoder sequence-to-sequence models typically allow them to preserve the semantics of the input sentence, input features such as formality might be left out. On the other hand, downstream tasks such as translation would benefit from working with a sentence representation that preserves formality in addition to semantics, so as to generate sentences with the appropriate level of social formality -- the difference between speaking to a friend versus speaking with a supervisor. We propose a sequence-to-sequence method for learning a formality-aware representation for Japanese sentences, where sentence generation is conditioned on both the original representation of the input sentence, and a side constraint which guides the sentence representation towards preserving formality information. Additionally, we propose augmenting the sentence representation with a learned representation of formality which facilitates the extraction of formality in downstream tasks. We address the lack of formality-annotated parallel data by adapting previous works on procedural formality classification of Japanese sentences. Experimental results suggest that our techniques not only helps the decoder recover the formality of the input sentence, but also slightly improves the preservation of input sentence semantics.  ( 2 min )
    Efficient correlation-based discretization of continuous variables for annealing machines. (arXiv:2301.07244v1 [quant-ph])
    Annealing machines specialized for combinatorial optimization problems have been developed, and some companies offer services to use those machines. Such specialized machines can only handle binary variables, and their input format is the quadratic unconstrained binary optimization (QUBO) formulation. Therefore, discretization is necessary to solve problems with continuous variables. However, there is a severe constraint on the number of binary variables with such machines. Although the simple binary expansion in the previous research requires many binary variables, we need to reduce the number of such variables in the QUBO formulation due to the constraint. We propose a discretization method that involves using correlations of continuous variables. We numerically show that the proposed method reduces the number of necessary binary variables in the QUBO formulation without a significant loss in prediction accuracy.  ( 2 min )
    Discrete Latent Structure in Neural Networks. (arXiv:2301.07473v1 [cs.LG])
    Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.  ( 2 min )
    DDPEN: Trajectory Optimisation With Sub Goal Generation Model. (arXiv:2301.07433v1 [cs.RO])
    Differential dynamic programming (DDP) is a widely used and powerful trajectory optimization technique, however, due to its internal structure, it is not exempt from local minima. In this paper, we present Differential Dynamic Programming with Escape Network (DDPEN) - a novel approach to avoid DDP local minima by utilising an additional term used in the optimization criteria pointing towards the direction where robot should move in order to escape local minima. In order to produce the aforementioned directions, we propose to utilize a deep model that takes as an input the map of the environment in the form of a costmap together with the desired goal position. The Model produces possible future directions that will lead to the goal, avoiding local minima which is possible to run in real time conditions. The model is trained on a synthetic dataset and overall the system is evaluated at the Gazebo simulator. In this work we show that our proposed method allows avoiding local minima of trajectory optimization algorithm and successfully execute a trajectory 278 m long with various convex and nonconvex obstacles.  ( 2 min )
    DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training. (arXiv:2301.07421v1 [cs.LG])
    We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.  ( 2 min )
    Image Embedding for Denoising Generative Models. (arXiv:2301.07485v1 [cs.CV])
    Denoising Diffusion models are gaining increasing popularity in the field of generative modeling for several reasons, including the simple and stable training, the excellent generative quality, and the solid probabilistic foundation. In this article, we address the problem of {\em embedding} an image into the latent space of Denoising Diffusion Models, that is finding a suitable ``noisy'' image whose denoising results in the original image. We particularly focus on Denoising Diffusion Implicit Models due to the deterministic nature of their reverse diffusion process. As a side result of our investigation, we gain a deeper insight into the structure of the latent space of diffusion models, opening interesting perspectives on its exploration, the definition of semantic trajectories, and the manipulation/conditioning of encodings for editing purposes. A particularly interesting property highlighted by our research, which is also characteristic of this class of generative models, is the independence of the latent representation from the networks implementing the reverse diffusion process. In other words, a common seed passed to different networks (each trained on the same dataset), eventually results in identical images.  ( 2 min )
    Optimistic Dynamic Regret Bounds. (arXiv:2301.07530v1 [cs.LG])
    Online Learning (OL) algorithms have originally been developed to guarantee good performances when comparing their output to the best fixed strategy. The question of performance with respect to dynamic strategies remains an active research topic. We develop in this work dynamic adaptations of classical OL algorithms based on the use of experts' advice and the notion of optimism. We also propose a constructivist method to generate those advices and eventually provide both theoretical and experimental guarantees for our procedures.  ( 2 min )
    PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav. (arXiv:2301.07302v1 [cs.LG])
    We study ObjectGoal Navigation - where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) on a dataset of human demonstrations achieves promising results. However, this has limitations $-$ 1) IL policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning. This leads to a PIRLNav policy that advances the state-of-the-art on ObjectNav from $60.0\%$ success rate to $65.0\%$ ($+5.0\%$ absolute). Using this IL$\rightarrow$RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that IL$\rightarrow$RL on human demonstrations outperforms IL$\rightarrow$RL on SP and FE trajectories, even when controlled for the same IL-pretraining success on TRAIN, and even on a subset of VAL episodes where IL-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the IL pretraining dataset. We find that as we increase the size of the IL-pretraining dataset and get to high IL accuracies, the improvements from RL-finetuning are smaller, and that $90\%$ of the performance of our best IL$\rightarrow$RL policy can be achieved with less than half the number of IL demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.  ( 2 min )
    Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction. (arXiv:2301.07568v1 [q-bio.BM])
    The goal of Protein Structure Prediction (PSP) problem is to predict a protein's 3D structure (confirmation) from its amino acid sequence. The problem has been a 'holy grail' of science since the Noble prize-winning work of Anfinsen demonstrated that protein conformation was determined by sequence. A recent and important step towards this goal was the development of AlphaFold2, currently the best PSP method. AlphaFold2 is probably the highest profile application of AI to science. Both AlphaFold2 and RoseTTAFold (another impressive PSP method) have been published and placed in the public domain (code & models). Stacking is a form of ensemble machine learning ML in which multiple baseline models are first learnt, then a meta-model is learnt using the outputs of the baseline level model to form a model that outperforms the base models. Stacking has been successful in many applications. We developed the ARStack PSP method by stacking AlphaFold2 and RoseTTAFold. ARStack significantly outperforms AlphaFold2. We rigorously demonstrate this using two sets of non-homologous proteins, and a test set of protein structures published after that of AlphaFold2 and RoseTTAFold. As more high quality prediction methods are published it is likely that ensemble methods will increasingly outperform any single method.  ( 2 min )
    Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (arXiv:2301.07487v1 [cs.LG])
    Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.  ( 2 min )
    Learning Deformation Trajectories of Boltzmann Densities. (arXiv:2301.07388v1 [stat.ML])
    We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = (|x|/\sigma)^p$. This then induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along this family of densities. Concretely, this condition can be translated to a PDE between $V_t$ and $f_t$ and we minimize the amount by which this PDE fails to hold. We compare this objective to the reverse KL-divergence on Gaussian mixtures and on the $\phi^4$ lattice field theory on a circle.  ( 2 min )
    Threats, Vulnerabilities, and Controls of Machine Learning Based Systems: A Survey and Taxonomy. (arXiv:2301.07474v1 [cs.CR])
    In this article, we propose the Artificial Intelligence Security Taxonomy to systematize the knowledge of threats, vulnerabilities, and security controls of ML-based systems. We first classify the damage caused by attacks against ML-based systems, define ML-specific security, and discuss its characteristics. Next, we enumerate all relevant assets and stakeholders and provide a general taxonomy for ML-specific threats. Then, we collect a wide range of security controls against ML-specific threats through an extensive review of recent literature. Finally, we classify the vulnerabilities and controls of an ML-based system in terms of each vulnerable asset in the system's entire lifecycle.  ( 2 min )
    PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection. (arXiv:2301.07301v1 [cs.CV])
    In autonomous driving, 3D object detection based on multi-modal data has become an indispensable approach when facing complex environments around the vehicle. During multi-modal detection, LiDAR and camera are simultaneously applied for capturing and modeling. However, due to the intrinsic discrepancies between the LiDAR point and camera image, the fusion of the data for object detection encounters a series of problems. Most multi-modal detection methods perform even worse than LiDAR-only methods. In this investigation, we propose a method named PTA-Det to improve the performance of multi-modal detection. Accompanied by PTA-Det, a Pseudo Point Cloud Generation Network is proposed, which can convert image information including texture and semantic features by pseudo points. Thereafter, through a transformer-based Point Fusion Transition (PFT) module, the features of LiDAR points and pseudo points from image can be deeply fused under a unified point-based representation. The combination of these modules can conquer the major obstacle in feature fusion across modalities and realizes a complementary and discriminative representation for proposal generation. Extensive experiments on the KITTI dataset show the PTA-Det achieves a competitive result and support its effectiveness.  ( 2 min )
    Causal Falsification of Digital Twins. (arXiv:2301.07210v1 [stat.ME])
    Digital twins hold substantial promise in many applications, but rigorous procedures for assessing their accuracy are essential for their widespread deployment in safety-critical settings. By formulating this task within the framework of causal inference, we show it is not possible to certify that a twin is "correct" using real-world observational data unless potentially tenuous assumptions are made about the data-generating process. To avoid these assumptions, we propose an assessment strategy that instead aims to find cases where the twin is not correct, and present a general-purpose statistical procedure for doing so that may be used across a wide variety of applications and twin models. Our approach yields reliable and actionable information about the twin under only the assumption of an i.i.d. dataset of real-world observations, and in particular remains sound even in the presence of arbitrary unmeasured confounding. We demonstrate the effectiveness of our methodology via a large-scale case study involving sepsis modelling within the Pulse Physiology Engine, which we assess using the MIMIC-III dataset of ICU patients.  ( 2 min )
    Improve Noise Tolerance of Robust Loss via Noise-Awareness. (arXiv:2301.07306v1 [cs.LG])
    Robust loss minimization is an important strategy for handling robust learning issue on noisy labels. Current robust losses, however, inevitably involve hyperparameters to be tuned for different datasets with noisy labels, manually or heuristically through cross validation, which makes them fairly hard to be generally applied in practice. Existing robust loss methods usually assume that all training samples share common hyperparameters, which are independent of instances. This limits the ability of these methods on distinguishing individual noise properties of different samples, making them hardly adapt to different noise structures. To address above issues, we propose to assemble robust loss with instance-dependent hyperparameters to improve their noise-tolerance with theoretical guarantee. To achieve setting such instance-dependent hyperparameters for robust loss, we propose a meta-learning method capable of adaptively learning a hyperparameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster). Specifically, through mutual amelioration between hyperparameter prediction function and classifier parameters in our method, both of them can be simultaneously finely ameliorated and coordinated to attain solutions with good generalization capability. Four kinds of SOTA robust losses are attempted to be integrated with our algorithm, and experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and generalization performance. Meanwhile, the explicit parameterized structure makes the meta-learned prediction function capable of being readily transferrable and plug-and-play to unseen datasets with noisy labels. Specifically, we transfer our meta-learned NARL-Adjuster to unseen tasks, including several real noisy datasets, and achieve better performance compared with conventional hyperparameter tuning strategy.  ( 2 min )
    Tailor: Altering Skip Connections for Resource-Efficient Inference. (arXiv:2301.07247v1 [cs.CV])
    Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network's skip connections to lower their hardware cost. The optimized hardware designs improve resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs.  ( 2 min )
    Detecting and Ranking Causal Anomalies in End-to-End Complex System. (arXiv:2301.07281v1 [cs.LG])
    With the rapid development of technology, the automated monitoring systems of large-scale factories are becoming more and more important. By collecting a large amount of machine sensor data, we can have many ways to find anomalies. We believe that the real core value of an automated monitoring system is to identify and track the cause of the problem. The most famous method for finding causal anomalies is RCA, but there are many problems that cannot be ignored. They used the AutoRegressive eXogenous (ARX) model to create a time-invariant correlation network as a machine profile, and then use this profile to track the causal anomalies by means of a method called fault propagation. There are two major problems in describing the behavior of a machine by using the correlation network established by ARX: (1) It does not take into account the diversity of states (2) It does not separately consider the correlations with different time-lag. Based on these problems, we propose a framework called Ranking Causal Anomalies in End-to-End System (RCAE2E), which completely solves the problems mentioned above. In the experimental part, we use synthetic data and real-world large-scale photoelectric factory data to verify the correctness and existence of our method hypothesis.  ( 2 min )
    Towards Models that Can See and Read. (arXiv:2301.07389v1 [cs.CV])
    Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite the obvious resemblance between them, the two are treated independently, yielding task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on VQA and CAP by up to 3.49% and 0.7 CIDEr, respectively.  ( 2 min )
    Relativistic Digital Twin: Bringing the IoT to the Future. (arXiv:2301.07390v1 [cs.NI])
    Complex IoT ecosystems often require the usage of Digital Twins (DTs) of their physical assets in order to perform predictive analytics and simulate what-if scenarios. DTs are able to replicate IoT devices and adapt over time to their behavioral changes. However, DTs in IoT are typically tailored to a specific use case, without the possibility to seamlessly adapt to different scenarios. Further, the fragmentation of IoT poses additional challenges on how to deploy DTs in heterogeneous scenarios characterized by the usage of multiple data formats and IoT network protocols. In this paper, we propose the Relativistic Digital Twin (RDT) framework, through which we automatically generate general purpose DTs of IoT entities and tune their behavioral models over time by constantly observing their real counterparts. The framework relies on the object representation via the Web of Things (WoT), to offer a standardized interface to each of the IoT devices as well as to their DTs. To this purpose, we extended the W3C WoT standard in order to encompass the concept of behavioral model and define it in the Thing Description (TD) through a new vocabulary. Finally, we evaluated the RDT framework over two disjoint use cases to assess its correctness and learning performance, i.e. the DT of a simulated smart home scenario with the capability of forecasting the indoor temperature, and the DT of a real-world drone with the capability of forecasting its trajectory in an outdoor scenario.  ( 2 min )
    Complexity Analysis of a Countable-armed Bandit Problem. (arXiv:2301.07243v1 [cs.LG])
    We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.  ( 2 min )
    Label Inference Attack against Split Learning under Regression Setting. (arXiv:2301.07284v1 [cs.CR])
    As a crucial building block in vertical Federated Learning (vFL), Split Learning (SL) has demonstrated its practice in the two-party model training collaboration, where one party holds the features of data samples and another party holds the corresponding labels. Such method is claimed to be private considering the shared information is only the embedding vectors and gradients instead of private raw data and labels. However, some recent works have shown that the private labels could be leaked by the gradients. These existing attack only works under the classification setting where the private labels are discrete. In this work, we step further to study the leakage in the scenario of the regression model, where the private labels are continuous numbers (instead of discrete labels in classification). This makes previous attacks harder to infer the continuous labels due to the unbounded output range. To address the limitation, we propose a novel learning-based attack that integrates gradient information and extra learning regularization objectives in aspects of model training properties, which can infer the labels under regression settings effectively. The comprehensive experiments on various datasets and models have demonstrated the effectiveness of our proposed attack. We hope our work can pave the way for future analyses that make the vFL framework more secure.  ( 2 min )
    Adapting Multilingual Speech Representation Model for a New, Underresourced Language through Multilingual Fine-tuning and Continued Pretraining. (arXiv:2301.07295v1 [cs.CL])
    In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model's performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.  ( 2 min )
    A variational autoencoder-based nonnegative matrix factorisation model for deep dictionary learning. (arXiv:2301.07272v1 [cs.LG])
    Construction of dictionaries using nonnegative matrix factorisation (NMF) has extensive applications in signal processing and machine learning. With the advances in deep learning, training compact and robust dictionaries using deep neural networks, i.e., dictionaries of deep features, has been proposed. In this study, we propose a probabilistic generative model which employs a variational autoencoder (VAE) to perform nonnegative dictionary learning. In contrast to the existing VAE models, we cast the model under a statistical framework with latent variables obeying a Gamma distribution and design a new loss function to guarantee the nonnegative dictionaries. We adopt an acceptance-rejection sampling reparameterization trick to update the latent variables iteratively. We apply the dictionaries learned from VAE-NMF to two signal processing tasks, i.e., enhancement of speech and extraction of muscle synergies. Experimental results demonstrate that VAE-NMF performs better in learning the latent nonnegative dictionaries in comparison with state-of-the-art methods.  ( 2 min )
    Tracking Brand-Associated Polarity-Bearing Topics in User Reviews. (arXiv:2301.07183v1 [cs.IR])
    Monitoring online customer reviews is important for business organisations to measure customer satisfaction and better manage their reputations. In this paper, we propose a novel dynamic Brand-Topic Model (dBTM) which is able to automatically detect and track brand-associated sentiment scores and polarity-bearing topics from product reviews organised in temporally-ordered time intervals. dBTM models the evolution of the latent brand polarity scores and the topic-word distributions over time by Gaussian state space models. It also incorporates a meta learning strategy to control the update of the topic-word distribution in each time interval in order to ensure smooth topic transitions and better brand score predictions. It has been evaluated on a dataset constructed from MakeupAlley reviews and a hotel review dataset. Experimental results show that dBTM outperforms a number of competitive baselines in brand ranking, achieving a good balance of topic coherence and uniqueness, and extracting well-separated polarity-bearing topics across time intervals.  ( 2 min )
    Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. (arXiv:2301.07206v1 [stat.ML])
    Relating a set of variables X to a response y is crucial in chemometrics. A quantitative prediction objective can be enriched by qualitative data interpretation, for instance by locating the most influential features. When high-dimensional problems arise, dimension reduction techniques can be used. Most notable are projections (e.g. Partial Least Squares or PLS ) or variable selections (e.g. lasso). Sparse partial least squares combine both strategies, by blending variable selection into PLS. The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. It is based on penalizations inspired by classical regression methods (lasso, group lasso, least squares, ridge) and uses the dual norm notion. The resulting sparsity is enforced by an intuitive shrinking ratio parameter. Dual-sPLS favorably compares to similar regression methods, on simulated and real chemical data. Code is provided as an open-source package in R: \url{https://CRAN.R-project.org/package=dual.spls}.  ( 2 min )
    Scaffold-Based Multi-Objective Drug Candidate Optimization. (arXiv:2301.07175v1 [q-bio.BM])
    Multiparameter optimization (MPO) provides a means to assess and balance several variables based on their importance to the overall objective. However, using MPO methods in therapeutic discovery is challenging due to the number of cheminformatics properties required to find an optimal solution. High throughput virtual screening to identify hit candidates produces a large amount of data with conflicting properties. For instance, toxicity and binding affinity can contradict each other and cause improbable levels of toxicity that can lead to adverse effects. Instead of using the exhaustive method of treating each property, multiple properties can be combined into a single MPO score, with weights assigned for each property. This desirability score also lends itself well to ML applications that can use the score in the loss function. In this work, we will discuss scaffold focused graph-based Markov chain monte carlo framework built to generate molecules with optimal properties. This framework trains itself on-the-fly with the MPO score of each iteration of molecules, and is able to work on a greater number of properties and sample the chemical space around a starting scaffold. Results are compared to the chemical Transformer model molGCT to judge performance between graph and natural language processing approaches.  ( 2 min )
    Artificial Neuronal Ensembles with Learned Context Dependent Gating. (arXiv:2301.07187v1 [cs.LG])
    Biological neural networks are capable of recruiting different sets of neurons to encode different memories. However, when training artificial neural networks on a set of tasks, typically, no mechanism is employed for selectively producing anything analogous to these neuronal ensembles. Further, artificial neural networks suffer from catastrophic forgetting, where the network's performance rapidly deteriorates as tasks are learned sequentially. By contrast, sequential learning is possible for a range of biological organisms. We introduce Learned Context Dependent Gating (LXDG), a method to flexibly allocate and recall `artificial neuronal ensembles', using a particular network structure and a new set of regularization terms. Activities in the hidden layers of the network are modulated by gates, which are dynamically produced during training. The gates are outputs of networks themselves, trained with a sigmoid output activation. The regularization terms we have introduced correspond to properties exhibited by biological neuronal ensembles. The first term penalizes low gate sparsity, ensuring that only a specified fraction of the network is used. The second term ensures that previously learned gates are recalled when the network is presented with input from previously learned tasks. Finally, there is a regularization term responsible for ensuring that new tasks are encoded in gates that are as orthogonal as possible from previously used ones. We demonstrate the ability of this method to alleviate catastrophic forgetting on continual learning benchmarks. When the new regularization terms are included in the model along with Elastic Weight Consolidation (EWC) it achieves better performance on the benchmark `permuted MNIST' than with EWC alone. The benchmark `rotated MNIST' demonstrates how similar tasks recruit similar neurons to the artificial neuronal ensemble.  ( 2 min )
    Revisiting mass-radius relationships for exoplanet populations: a machine learning insight. (arXiv:2301.07143v1 [astro-ph.EP])
    The growing number of exoplanet discoveries and advances in machine learning techniques allow us to find, explore, and understand characteristics of these new worlds beyond our Solar System. We analyze the dataset of 762 confirmed exoplanets and eight Solar System planets using efficient machine-learning approaches to characterize their fundamental quantities. By adopting different unsupervised clustering algorithms, the data are divided into two main classes: planets with $\log R_{p}\leq0.91R_{\oplus}$ and $\log M_{p}\leq1.72M_{\oplus}$ as class 1 and those with $\log R_{p}>0.91R_{\oplus}$ and $\log M_{p}>1.72M_{\oplus}$ as class 2. Various regression models are used to reveal correlations between physical parameters and evaluate their performance. We find that planetary mass, orbital period, and stellar mass play preponderant roles in predicting exoplanet radius. The validation metrics (RMSE, MAE, and $R^{2}$) suggest that the Support Vector Regression has, by and large, better performance than other models and is a promising model for obtaining planetary radius. Not only do we improve the prediction accuracy in logarithmic space, but also we derive parametric equations using the M5P and Markov Chain Monte Carlo methods. Planets of class 1 are shown to be consistent with a positive linear mass-radius relation, while for planets of class 2, the planetary radius represents a strong correlation with their host stars' masses.  ( 2 min )
    Heterogeneous Multi-Robot Reinforcement Learning. (arXiv:2301.07137v1 [cs.RO])
    Cooperative multi-robot tasks can benefit from heterogeneity in the robots' physical and behavioral traits. In spite of this, traditional Multi-Agent Reinforcement Learning (MARL) frameworks lack the ability to explicitly accommodate policy heterogeneity, and typically constrain agents to share neural network parameters. This enforced homogeneity limits application in cases where the tasks benefit from heterogeneous behaviors. In this paper, we crystallize the role of heterogeneity in MARL policies. Towards this end, we introduce Heterogeneous Graph Neural Network Proximal Policy Optimization (HetGPPO), a paradigm for training heterogeneous MARL policies that leverages a Graph Neural Network for differentiable inter-agent communication. HetGPPO allows communicating agents to learn heterogeneous behaviors while enabling fully decentralized training in partially observable environments. We complement this with a taxonomical overview that exposes more heterogeneity classes than previously identified. To motivate the need for our model, we present a characterization of techniques that homogeneous models can leverage to emulate heterogeneous behavior, and show how this "apparent heterogeneity" is brittle in real-world conditions. Through simulations and real-world experiments, we show that: (i) when homogeneous methods fail due to strong heterogeneous requirements, HetGPPO succeeds, and, (ii) when homogeneous methods are able to learn apparently heterogeneous behaviors, HetGPPO achieves higher resilience to both training and deployment noise.  ( 2 min )
    Mortality Prediction with Adaptive Feature Importance Recalibration for Peritoneal Dialysis Patients: a deep-learning-based study on a real-world longitudinal follow-up dataset. (arXiv:2301.07107v1 [cs.LG])
    Objective: Peritoneal Dialysis (PD) is one of the most widely used life-supporting therapies for patients with End-Stage Renal Disease (ESRD). Predicting mortality risk and identifying modifiable risk factors based on the Electronic Medical Records (EMR) collected along with the follow-up visits are of great importance for personalized medicine and early intervention. Here, our objective is to develop a deep learning model for a real-time, individualized, and interpretable mortality prediction model - AICare. Method and Materials: Our proposed model consists of a multi-channel feature extraction module and an adaptive feature importance recalibration module. AICare explicitly identifies the key features that strongly indicate the outcome prediction for each patient to build the health status embedding individually. This study has collected 13,091 clinical follow-up visits and demographic data of 656 PD patients. To verify the application universality, this study has also collected 4,789 visits of 1,363 hemodialysis dialysis (HD) as an additional experiment dataset to test the prediction performance, which will be discussed in the Appendix. Results: 1) Experiment results show that AICare achieves 81.6%/74.3% AUROC and 47.2%/32.5% AUPRC for the 1-year mortality prediction task on PD/HD dataset respectively, which outperforms the state-of-the-art comparative deep learning models. 2) This study first provides a comprehensive elucidation of the relationship between the causes of mortality in patients with PD and clinical features based on an end-to-end deep learning model. 3) This study first reveals the pattern of variation in the importance of each feature in the mortality prediction based on built-in interpretability. 4) We develop a practical AI-Doctor interaction system to visualize the trajectory of patients' health status and risk indicators.  ( 3 min )
    Genetic Imitation Learning by Reward Extrapolation. (arXiv:2301.07182v1 [cs.NE])
    Imitation learning demonstrates remarkable performance in various domains. However, imitation learning is also constrained by many prerequisites. The research community has done intensive research to alleviate these constraints, such as adding the stochastic policy to avoid unseen states, eliminating the need for action labels, and learning from the suboptimal demonstrations. Inspired by the natural reproduction process, we proposed a method called GenIL that integrates the Genetic Algorithm with imitation learning. The involvement of the Genetic Algorithm improves the data efficiency by reproducing trajectories with various returns and assists the model in estimating more accurate and compact reward function parameters. We tested GenIL in both Atari and Mujoco domains, and the result shows that it successfully outperforms the previous extrapolation methods over extrapolation accuracy, robustness, and overall policy performance when input data is limited.  ( 2 min )
    A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles. (arXiv:2301.07156v1 [cs.LG])
    In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a pre-processing for the road network graph to handle the constrained combinatorial optimization problem in an efficient way. Then, for the pre-processed graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in country-sized road networks, with simulation experiments in Norway, Sweden and Finland.  ( 2 min )
    Large Deviations for Classification Performance Analysis of Machine Learning Systems. (arXiv:2301.07104v1 [cs.LG])
    We study the performance of machine learning binary classification techniques in terms of error probabilities. The statistical test is based on the Data-Driven Decision Function (D3F), learned in the training phase, i.e., what is thresholded before the final binary decision is made. Based on large deviations theory, we show that under appropriate conditions the classification error probabilities vanish exponentially, as $\sim \exp\left(-n\,I + o(n) \right)$, where $I$ is the error rate and $n$ is the number of observations available for testing. We also propose two different approximations for the error probability curves, one based on a refined asymptotic formula (often referred to as exact asymptotics), and another one based on the central limit theorem. The theoretical findings are finally tested using the popular MNIST dataset.  ( 2 min )
    Continuous Trajectory Generation Based on Two-Stage GAN. (arXiv:2301.07103v1 [cs.LG])
    Simulating the human mobility and generating large-scale trajectories are of great use in many real-world applications, such as urban planning, epidemic spreading analysis, and geographic privacy protect. Although many previous works have studied the problem of trajectory generation, the continuity of the generated trajectories has been neglected, which makes these methods useless for practical urban simulation scenarios. To solve this problem, we propose a novel two-stage generative adversarial framework to generate the continuous trajectory on the road network, namely TS-TrajGen, which efficiently integrates prior domain knowledge of human mobility with model-free learning paradigm. Specifically, we build the generator under the human mobility hypothesis of the A* algorithm to learn the human mobility behavior. For the discriminator, we combine the sequential reward with the mobility yaw reward to enhance the effectiveness of the generator. Finally, we propose a novel two-stage generation process to overcome the weak point of the existing stochastic generation process. Extensive experiments on two real-world datasets and two case studies demonstrate that our framework yields significant improvements over the state-of-the-art methods.  ( 2 min )
    On Using Deep Learning Proxies as Forward Models in Deep Learning Problems. (arXiv:2301.07102v1 [cs.LG])
    Physics-based optimization problems are generally very time-consuming, especially due to the computational complexity associated with the forward model. Recent works have demonstrated that physics-modelling can be approximated with neural networks. However, there is always a certain degree of error associated with this learning, and we study this aspect in this paper. We demonstrate through experiments on popular mathematical benchmarks, that neural network approximations (NN-proxies) of such functions when plugged into the optimization framework, can lead to erroneous results. In particular, we study the behavior of particle swarm optimization and genetic algorithm methods and analyze their stability when coupled with NN-proxies. The correctness of the approximate model depends on the extent of sampling conducted in the parameter space, and through numerical experiments, we demonstrate that caution needs to be taken when constructing this landscape with neural networks. Further, the NN-proxies are hard to train for higher dimensional functions, and we present our insights for 4D and 10D problems. The error is higher for such cases, and we demonstrate that it is sensitive to the choice of the sampling scheme used to build the NN-proxy. The code is available at https://github.com/Fa-ti-ma/NN-proxy-in-optimization.  ( 2 min )
    The moral authority of ChatGPT. (arXiv:2301.07098v1 [cs.CY])
    ChatGPT is not only fun to chat with, but it also searches information, answers questions, and gives advice. With consistent moral advice, it might improve the moral judgment and decisions of users, who often hold contradictory moral beliefs. Unfortunately, ChatGPT turns out highly inconsistent as a moral advisor. Nonetheless, it influences users' moral judgment, we find in an experiment, even if they know they are advised by a chatting bot, and they underestimate how much they are influenced. Thus, ChatGPT threatens to corrupt rather than improves users' judgment. These findings raise the question of how to ensure the responsible use of ChatGPT and similar AI. Transparency is often touted but seems ineffective. We propose training to improve digital literacy.  ( 2 min )
    Distributed LSTM-Learning from Differentially Private Label Proportions. (arXiv:2301.07101v1 [cs.LG])
    Data privacy and decentralised data collection has become more and more popular in recent years. In order to solve issues with privacy, communication bandwidth and learning from spatio-temporal data, we will propose two efficient models which use Differential Privacy and decentralized LSTM-Learning: One, in which a Long Short Term Memory (LSTM) model is learned for extracting local temporal node constraints and feeding them into a Dense-Layer (LabelProportionToLocal). The other approach extends the first one by fetching histogram data from the neighbors and joining the information with the LSTM output (LabelProportionToDense). For evaluation two popular datasets are used: Pems-Bay and METR-LA. Additionally, we provide an own dataset, which is based on LuST. The evaluation will show the tradeoff between performance and data privacy.  ( 2 min )
    EENet: Learning to Early Exit for Adaptive Inference. (arXiv:2301.07099v1 [cs.LG])
    Budgeted adaptive inference with early exits is an emerging technique to improve the computational efficiency of deep neural networks (DNNs) for edge AI applications with limited resources at test time. This method leverages the fact that different test data samples may not require the same amount of computation for a correct prediction. By allowing early exiting from full layers of DNN inference for some test examples, we can reduce latency and improve throughput of edge inference while preserving performance. Although there have been numerous studies on designing specialized DNN architectures for training early-exit enabled DNN models, most of the existing work employ hand-tuned or manual rule-based early exit policies. In this study, we introduce a novel multi-exit DNN inference framework, coined as EENet, which leverages multi-objective learning to optimize the early exit policy for a trained multi-exit DNN under a given inference budget. This paper makes two novel contributions. First, we introduce the concept of early exit utility scores by combining diverse confidence measures with class-wise prediction scores to better estimate the correctness of test-time predictions at a given exit. Second, we train a lightweight, budget-driven, multi-objective neural network over validation predictions to learn the exit assignment scheduling for query examples at test time. The EENet early exit scheduler optimizes both the distribution of test samples to different exits and the selection of the exit utility thresholds such that the given inference budget is satisfied while the performance metric is maximized. Extensive experiments are conducted on five benchmarks, including three image datasets (CIFAR-10, CIFAR-100, ImageNet) and two NLP datasets (SST-2, AgNews). The results demonstrate the performance improvements of EENet compared to existing representative early exit techniques.  ( 2 min )
  • Open

    Data thinning for convolution-closed distributions. (arXiv:2301.07276v1 [stat.ME])
    We propose data thinning, a new approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to any observation drawn from a "convolution closed" distribution, a class that includes the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. It is similar in spirit to -- but distinct from, and more easily applicable than -- a recent proposal known as data fission. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.  ( 2 min )
    Optimal Sub-sampling to Boost Power of Kernel Sequential Change-point Detection. (arXiv:2210.15060v2 [stat.ME] UPDATED)
    We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel Cumulative Sum (CUSUM) procedures, and improved performance is observed from extensive numerical experiments.  ( 2 min )
    A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles. (arXiv:2301.07156v1 [cs.LG])
    In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a pre-processing for the road network graph to handle the constrained combinatorial optimization problem in an efficient way. Then, for the pre-processed graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in country-sized road networks, with simulation experiments in Norway, Sweden and Finland.  ( 2 min )
    Learning Deformation Trajectories of Boltzmann Densities. (arXiv:2301.07388v1 [stat.ML])
    We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = (|x|/\sigma)^p$. This then induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along this family of densities. Concretely, this condition can be translated to a PDE between $V_t$ and $f_t$ and we minimize the amount by which this PDE fails to hold. We compare this objective to the reverse KL-divergence on Gaussian mixtures and on the $\phi^4$ lattice field theory on a circle.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v3 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that approximates the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations.  ( 2 min )
    A Nonsmooth Dynamical Systems Perspective on Accelerated Extensions of ADMM. (arXiv:1808.04048v7 [math.OC] UPDATED)
    Recently, there has been great interest in connections between continuous-time dynamical systems and optimization methods, notably in the context of accelerated methods for smooth and unconstrained problems. In this paper we extend this perspective to nonsmooth and constrained problems by obtaining differential inclusions associated to novel accelerated variants of the alternating direction method of multipliers (ADMM). Through a Lyapunov analysis, we derive rates of convergence for these dynamical systems in different settings that illustrate an interesting tradeoff between decaying versus constant damping strategies. We also obtain modified equations capturing fine-grained details of these methods, which have improved stability and preserve the leading order convergence rates. An extension to general nonlinear equality and inequality constraints in connection with singular perturbation theory is provided.  ( 2 min )
    Optimistic Dynamic Regret Bounds. (arXiv:2301.07530v1 [cs.LG])
    Online Learning (OL) algorithms have originally been developed to guarantee good performances when comparing their output to the best fixed strategy. The question of performance with respect to dynamic strategies remains an active research topic. We develop in this work dynamic adaptations of classical OL algorithms based on the use of experts' advice and the notion of optimism. We also propose a constructivist method to generate those advices and eventually provide both theoretical and experimental guarantees for our procedures.  ( 2 min )
    Functional Neural Networks: Shift invariant models for functional data with applications to EEG classification. (arXiv:2301.05869v1 [cs.LG] CROSS LISTED)
    It is desirable for statistical models to detect signals of interest independently of their position. If the data is generated by some smooth process, this additional structure should be taken into account. We introduce a new class of neural networks that are shift invariant and preserve smoothness of the data: functional neural networks (FNNs). For this, we use methods from functional data analysis (FDA) to extend multi-layer perceptrons and convolutional neural networks to functional data. We propose different model architectures, show that the models outperform a benchmark model from FDA in terms of accuracy and successfully use FNNs to classify electroencephalography (EEG) data.  ( 2 min )
    Complexity Analysis of a Countable-armed Bandit Problem. (arXiv:2301.07243v1 [cs.LG])
    We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.  ( 2 min )
    Improving Federated Learning Personalization via Model Agnostic Meta Learning. (arXiv:1909.12488v2 [cs.LG] UPDATED)
    Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research.  ( 2 min )
    What relations are reliably embeddable in Euclidean space?. (arXiv:1903.05347v3 [cs.LG] UPDATED)
    We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.  ( 2 min )
    Global Contrastive Batch Sampling via Optimization on Sample Permutations. (arXiv:2210.12874v3 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v2 [math.ST] UPDATED)
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (arXiv:2301.07487v1 [cs.LG])
    Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v1 [eess.SP] CROSS LISTED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
    Using Topological Data Analysis to classify Encrypted Bits. (arXiv:2301.07393v1 [cs.CR])
    We present a way to apply topological data analysis for classifying encrypted bits into distinct classes. Persistent homology is applied to generate topological features of a point cloud obtained from sets of encryptions. We see that this machine learning pipeline is able to classify our data successfully where classical models of machine learning fail to perform the task. We also see that this pipeline works as a dimensionality reduction method making this approach to classify encrypted data a realistic method to classify the given encryptioned bits.  ( 2 min )
    Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification. (arXiv:2301.07609v1 [stat.ML])
    Data-driven approaches coupled with physical knowledge are powerful techniques to model systems. The goal of such models is to efficiently solve for the underlying field by combining measurements with known physical laws. As many systems contain unknown elements, such as missing parameters, noisy data, or incomplete physical laws, this is widely approached as an uncertainty quantification problem. The common techniques to handle all the variables typically depend on the numerical scheme used to approximate the posterior, and it is desirable to have a method which is independent of any such discretization. Information field theory (IFT) provides the tools necessary to perform statistics over fields that are not necessarily Gaussian. We extend IFT to physics-informed IFT (PIFT) by encoding the functional priors with information about the physical laws which describe the field. The posteriors derived from this PIFT remain independent of any numerical scheme and can capture multiple modes, allowing for the solution of problems which are ill-posed. We demonstrate our approach through an analytical example involving the Klein-Gordon equation. We then develop a variant of stochastic gradient Langevin dynamics to draw samples from the joint posterior over the field and model parameters. We apply our method to numerical examples with various degrees of model-form error and to inverse problems involving nonlinear differential equations. As an addendum, the method is equipped with a metric which allows the posterior to automatically quantify model-form uncertainty. Because of this, our numerical experiments show that the method remains robust to even an incorrect representation of the physics given sufficient data. We numerically demonstrate that the method correctly identifies when the physics cannot be trusted, in which case it automatically treats learning the field as a regression problem.  ( 2 min )
    Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. (arXiv:2301.07206v1 [stat.ML])
    Relating a set of variables X to a response y is crucial in chemometrics. A quantitative prediction objective can be enriched by qualitative data interpretation, for instance by locating the most influential features. When high-dimensional problems arise, dimension reduction techniques can be used. Most notable are projections (e.g. Partial Least Squares or PLS ) or variable selections (e.g. lasso). Sparse partial least squares combine both strategies, by blending variable selection into PLS. The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. It is based on penalizations inspired by classical regression methods (lasso, group lasso, least squares, ridge) and uses the dual norm notion. The resulting sparsity is enforced by an intuitive shrinking ratio parameter. Dual-sPLS favorably compares to similar regression methods, on simulated and real chemical data. Code is provided as an open-source package in R: \url{https://CRAN.R-project.org/package=dual.spls}.  ( 2 min )
    Discrete Latent Structure in Neural Networks. (arXiv:2301.07473v1 [cs.LG])
    Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.  ( 2 min )
    Sample Complexity of Adversarially Robust Linear Classification on Separated Data. (arXiv:2012.10794v3 [cs.LG] UPDATED)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).  ( 2 min )
    Electronic excited states in deep variational Monte Carlo. (arXiv:2203.09472v3 [physics.chem-ph] UPDATED)
    Obtaining accurate ground and low-lying excited states of electronic systems is crucial in a multitude of important applications. One ab initio method for solving the Schr\"odinger equation that scales favorably for large systems is variational quantum Monte Carlo (QMC). The recently introduced deep QMC approach uses ansatzes represented by deep neural networks and generates nearly exact ground-state solutions for molecules containing up to a few dozen electrons, with the potential to scale to much larger systems where other highly accurate methods are not feasible. In this paper, we extend one such ansatz (PauliNet) to compute electronic excited states. We demonstrate our method on various small atoms and molecules and consistently achieve high accuracy for low-lying states. To highlight the method's potential, we compute the first excited state of the much larger benzene molecule, as well as the conical intersection of ethylene, with PauliNet matching results of more expensive high-level methods.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v5 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    Landscape Complexity for the Empirical Risk of Generalized Linear Models. (arXiv:1912.02143v5 [stat.ML] UPDATED)
    We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.  ( 2 min )
    Strong inductive biases provably prevent harmless interpolation. (arXiv:2301.07605v1 [stat.ML])
    Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -- a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.  ( 2 min )
    Non-IID Quantum Federated Learning with One-shot Communication Complexity. (arXiv:2209.00768v2 [quant-ph] UPDATED)
    Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients' data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.  ( 2 min )
    An Analysis of Loss Functions for Binary Classification and Regression. (arXiv:2301.07638v1 [stat.ML])
    This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.  ( 2 min )
    Large Deviations for Classification Performance Analysis of Machine Learning Systems. (arXiv:2301.07104v1 [cs.LG])
    We study the performance of machine learning binary classification techniques in terms of error probabilities. The statistical test is based on the Data-Driven Decision Function (D3F), learned in the training phase, i.e., what is thresholded before the final binary decision is made. Based on large deviations theory, we show that under appropriate conditions the classification error probabilities vanish exponentially, as $\sim \exp\left(-n\,I + o(n) \right)$, where $I$ is the error rate and $n$ is the number of observations available for testing. We also propose two different approximations for the error probability curves, one based on a refined asymptotic formula (often referred to as exact asymptotics), and another one based on the central limit theorem. The theoretical findings are finally tested using the popular MNIST dataset.  ( 2 min )

  • Open

    [D] ICLR 2023 results.
    Hi, Making a post for anything to be discussed related to ICLR 2023 results ​ One question I had: Is the exact time of result announcement fixed? submitted by /u/East-Beginning9987 [link] [comments]  ( 41 min )
    [Discussion] I'm Getting 50FPS With 4 Billion Parameters, Is That Good? - Compute Shader Implementation
    So I like to DIY a lot. I coded a neural network to run parallel in a compute shader with TanH activation, and the performance was much better than I expected. I Tested with a 3090 with many layers of 20,000 neurons until I reached 4 billion total parameters which ran around 50FPS when looped every frame. Is this above average performance for a GPU implementations? I haven't really tested out any other GPU implementations, so I was wondering if anyone here knows. submitted by /u/TheRPGGamerMan [link] [comments]  ( 42 min )
    I'm working on a project and need a open source chatbot that I can run locally and train to talk like a specific character, does anyone know one? [P]
    Title, I am trying to get a chatbot to act like Megumin, similar to Character AI, but open source and can be run on a local machine. Thank you! submitted by /u/otakuhacker123 [link] [comments]  ( 42 min )
    [D][P] Best Speech-to-Text Model for domain-specific data - Open Source vs. Paid Services
    I originally posted this here on r/learnmachinelearning but reposting here as it may be a more appropriate subreddit and/or may have a different perspective. I want a tool to programmatically generate transcripts from sermons. I have access to hundreds of sermon transcripts (and 100x more very similar in domain data) but less than a 40 transcripts with audio (~30 hours). I want the lowest WER (Word Error Rate) possible and can budge 100 hours for this project in 2023. Train my own acoustic model with the best open source offering OpenAI's Whisper seems to be the best available today? How much supervised data (e.g. hours of sermons with perfect transcripts) would I need to develop a model that would be more accurate than Google/AWS for my specific domain? Can I take a model already trained and "tune" it by augmenting the data I have? Use the best cloud speech-to-text API that I can provide in-domain data to to tune it AWS Transcribe and Google Speech to Text seem to be big players I've gone with AWS Transcribe since it can be tuned more easily with custom domain data (just upload text files) than Google's (which requires building phrase dictionaries with weights). Is there anything out there that's better for my use case? submitted by /u/Knecht_Christi [link] [comments]  ( 43 min )
    [D] Pre-trained Models for Domain-Specific (i.e. Stylistic) Feature Extraction
    Most or all of the style transfer models are used to extract the domain-independent (robust) feature from an artpiece so as to apply it to different styles. But I need the opposite: I need a pretrained model that can extract the domain-specific (i.e. stylistic) feature from an artpiece. Are there any publicly available ones I can use? It doesn't matter whether it's a Github repository, a Huggingface API, or something else. Thank you! submitted by /u/No_Zookeepergame8794 [link] [comments]  ( 42 min )
    [P] Labeling tools are great, but what about quality checks?
    Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. MLfix is a new open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors. https://www.collabora.com/news-and-blog/blog/2023/01/17/labeling-tools-are-great-but-what-about-quality-checks/ submitted by /u/mfilion [link] [comments]  ( 42 min )
    [D] is it time to investigate retrieval language models?
    With ChatGPT going mainstream and the general push to make products out of LM, a problem remain about the cost of running such models. To me, it seems counterproductive to put both language modelling and knowledge inside the model weights. Is it time to shift to retrieval LM like Retro to keep the cost down while offering the same products? It would possibly allow Google or others to offer a free assistant service, using embeddings similarity search to retrieve results from the Internet so the model itself could possibly even run on edge devices? What are your thoughts about that subject? submitted by /u/hapliniste [link] [comments]  ( 46 min )
    [P] Code super clean multi-modal PyTorch models and easily serve them through FastAPI, using DocArray
    Hi all! I'd like to share an open source project that I am currently working on together with a few colleagues: DocArray! If you've ever trained models that deal with different data types (images, text, video, audio, ...) then you know how much of a hassle it can be to keep track of all of your tensors, what shapes they have, and what data they are meant to represent. That's what we're trying to change with DocArray, a Python library for representing, sending, and storing multi-modal data! The core idea of DocArray is that you define Documents that represent your data. For example, one Document could hold the file path to an image, its image tensor, and and image embedding that your model creates. A different Document could do the same thing for some Text, and a third Document might co…  ( 44 min )
    [P] Tired of generating synthetic corgis❓🐶 Check out Synthcity, a framework for synthetic tabular data
    🌟 Synthcity isa library for generating and benchmarking synthetic tabular data. https://github.com/vanderschaarlab/synthcity ​ 🚀 Synthcity includes a wide range of algorithms for various use cases, such as: - tabular data(CTGAN, TVAE, Bayesian Networks etc) - survival analysis(SurvivalGAN etc). - time series(Fourier Flows, TimeGAN, etc.). - privacy-focused(DP-GAN, PATEGAN, AdsGAN, DECAF). - domain adaptation(RadialGAN). ​ 🔍 Synthcity supports benchmarking multiple algorithms, testing data quality, downstream performance, statistical fidelity, and privacy metrics. ​ 🌀 Give it a try: - Library: https://github.com/vanderschaarlab/synthcity - Tutorial: https://colab.research.google.com/drive/1Vr2PJswgfFYBkJCm3hhVkuH-9dXnHeYV?usp=sharing - Docs: https://synthcity.readthedocs.io/ submitted by /u/ManagementBig2995 [link] [comments]  ( 42 min )
    [P] Need some recs on an NLP project
    Hello, for my job, I have to extract job responsibilities from job ads. I'm thinking of approaching it as a span extraction problem. Where I'm gonna label the job responsibility span manually for around 1000 samples. And use supervised learning. Is there any better way to approach this problem? Is there any pretrained model I can use to fine tune? Any suggestion will be appreciated. Thanks! submitted by /u/Salekeen01 [link] [comments]  ( 43 min )
    [R] Human-Timescale Adaptation in an Open-Ended Task Space - (AdA) - DeepMind 2023 - Can adapt to open-ended novel embodied 3D problems as quickly as humans!
    Paper: https://arxiv.org/abs/2301.07608 Youtube: https://www.youtube.com/watch?v=U93bUQ1roiw Please watch the Video the explanations are better than me giving you 3-5 Pictures! Abstract: Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. https://preview.redd.it/is3pyl1p70da1.jpg?width=1424&format=pjpg&auto=webp&s=8d102af4202711be7e01619577f109b6598d6ed5 submitted by /u/Singularian2501 [link] [comments]  ( 44 min )
    [D] Question about using diffusion to denoise images
    Hi all, I am trying to see if I can use DDPM (Denoising Diffusion Probabilistic Model) to denoise images using a supervised learning approach. However, I've learned that DDPM is only for unconditional image generation. Has anyone had experience using conditional DDPM and could help me out with some conceptual questions? Here's what I'm trying to understand: Say I have a pair of noisy and clean ground truth images. Should I take my clean image and gradually corrupt it by adding gaussian noise in the forward diffusion (FD) process? Could I get the network to learn the reverse diffusion process by giving it the noisy input, the FD noisy image, and positional embeddings? I was planning on concatenating the noisy input with the FD noisy image. During training, the network learns to predict noise at t-1 given the image at t conditioned on the input noisy source image. Here is an image showing you what I mean. Any thoughts or suggestions would be greatly appreciated. DDPM for image denoising submitted by /u/CurrentlyJoblessFML [link] [comments]  ( 44 min )
    [D] What is the name of this NLP technique?
    Lets say I have a dataset of real estate listings. I have a column of text that describes the listing, and another column that shows the number of rooms for example. In most of the cases, the number of rooms is shown in both columns, in the description text and also in the dedicated column. But for some observations, the number of rooms is in the description text but not in the column "number of rooms". So I have missing data. I could try to fill the missing data with by applying regex in the description text, but the number of possibilities seems to big. Is there a machine learning technique in NLP that allows me to do that, since it most of the observations the data is present in both column, so is "naturally labelled"? If there is, what is the name of these techniques? I would like to search about it but I don't know the proper keywords to google. submitted by /u/Kebet-Mendez [link] [comments]  ( 43 min )
    [D] Inner workings of the chatgpt memory
    All the examples from langchain and on huggingface create memory by pasting the entire history in every prompt. This seems to violate the max input prompt length pretty quickly. And it’s expensive. Does chatgpt use something revolutionary? It forgets everything when you create a new session so it ‘feels’ it’s using the convo as memory as well. But then the question; how do they get past prompt limits? Chunking doesn’t help as it still doesn’t get context in that case between prompts. Maybe they ask the same question with different chunks many times and then ask for a final result? Apologies if this was answered somewhere, I cannot find it at all and all examples use the same kind of history memory. submitted by /u/terserterseness [link] [comments]  ( 45 min )
    [D] very short video generation
    Hi, my indie game devs asked me if I could build a model that generates cool movements for their characters. 1) I wanted to start by generating characters and scene. Should I go for stable diffusion or for a GAN ? I do need a prompt 2) do you know any model that can generate short video clips (2 seconds) and that could potentially generate character movement ? Thank you so much ! submitted by /u/Frizzoux [link] [comments]  ( 42 min )
    [D] ML Researchers/Engineers in Industry: Why don't companies use open source models more often?
    In my experience at big tech, I've never seen any company use open-source ML models in production. Why is this the case? Curious, because there seems to be some insanely cool research going on these days. On the other hand, if you have seen this used, what kind of repos have you guys seen? submitted by /u/tennismlandguitar [link] [comments]  ( 46 min )
    [Discussion] Storing hundreds of ML models - what do you use?
    I am currently using the Google Cloud Model Registry and I want to learn what you use for archiving your machine learning models. What are the other options for developers who have to store hundreds of models? https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-model-registry submitted by /u/May-is-spring [link] [comments]  ( 42 min )
    [D] Best LLM for Question/Answering with personality?
    Hi, I am looking at fine tuning an open source LLM to answer questions as a specific character from chat on discord. I'm trying to decide which one to test between KoboldAI, GPTJ, Neo, Flan-T5, etc. Has anyone tries these LLMs and knows which would be best for this use case? The use case is to have a character that answers questions from discord chat in the form a specific character, with a personality and can be mean for example, very similar to https://beta.character.ai/ Does anyone want to guess what they use based on their experience or the closest LLM to replicate it? if this helps sometimes the model will go off on tangential stories which makes me think maybe it was trained on a novel or story dataset originally. submitted by /u/TernaryJimbo [link] [comments]  ( 42 min )
  • Open

    YouTube Video scripted entirely by ChatGPT???
    submitted by /u/McFIyyy [link] [comments]  ( 40 min )
    Inventory Management?
    I read a study that showed AI is being deployed into inventory management at a whopping 44% of total usage, though I can't quantify this, I am interested to know how to combine AI with my ERP, static data, a database etc. Does anyone here have experience with AI inventory management for MoQ, forecasting etc that they recommend? I want to do some homework submitted by /u/smudgepost [link] [comments]  ( 40 min )
    WhatsApp ChatGPT
    Hello i wanna ask you guys about this AI whatapp tool https://chattycat.ju.mp/ is it safe i saw it in a tweet and i was wondering if its the same as chat gpt and is it safe to give your name number and email " original tweet " https://twitter.com/TansuYegen/status/1616138232894492672 ​ WhatsApp gpt submitted by /u/Jnxe [link] [comments]  ( 40 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters don’t misunderstand AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    The Next Step (I believe)
    We have already seen the capabilities of AI sourcing from the Internet to create, learn, and exceed our expectations. I want to be quick with my description so I don't loose anybody. The next step I believe will be AI catered to us individually. What I mean is an app that only cares for one person, that being you. It will comb over (with permission hopefully) everything you ever put online. Also track writting, Grammer, interests (internet of things information), about you. Then if you allow it, can go over your medical or financial history, even psychology. It will feel like a chatbot but be a middleman towards seeking healthcare, mental care, and may other things including niché interests and possible career routes. I don't propose this to invite any fear or anxiety of a sci-fi narrative, but to objectively observe the trends with AI and my belief where it is going. submitted by /u/DropDeddBlue [link] [comments]  ( 41 min )
    2 months to make ai video last summer: “The technology was evolving so fast that I worried my video would feel outdated by the time it would be ready.”
    submitted by /u/defensiveFruit [link] [comments]  ( 40 min )
    Announcing a major update to Perplexity Ask: the world’s first conversational search engine! Now, you can read answers with up-to-date sources and ask follow-up questions to dig deeper. In other words, you can chat with your search engine!
    submitted by /u/rafs2006 [link] [comments]  ( 40 min )
    is the data we have today enough to create AGI?
    let's say hypothetically we were only able to work with digital data we have collected up until today to try and create AGI, would it be possible? submitted by /u/Science_is_Greatness [link] [comments]  ( 42 min )
    Labeling tools are great, but what about quality checks?
    Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. MLfix is a new open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors. https://www.collabora.com/news-and-blog/blog/2023/01/17/labeling-tools-are-great-but-what-about-quality-checks/ submitted by /u/mfilion [link] [comments]  ( 40 min )
    Google Research And DeepMind Create AI Medical Chatbot That Can Generate Safe And Helpful Answers!
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Software to read "Once upon a time in Hollywood" book in character voices from film.
    I want an audiobook version of the book "Once upon a time in Hollywood" by Tarantino with the actors voices from the movie, using the voices from the movie as sample. How do I do that? submitted by /u/Art3mis_eros [link] [comments]  ( 40 min )
    Professor Initiates Integration of ChatGPT in Classroom
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    Neural Network 'Hallucinating' While Training On Dog Images
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 41 min )
    Wi-Fi Could Help Identify When You’re Struggling to Breathe
    submitted by /u/goronmask [link] [comments]  ( 40 min )
    Summarizing Text using In-database NLP through the Integration of Hugging Face with MindsDB
    submitted by /u/Klutzy_Accountant113 [link] [comments]  ( 40 min )
    Farmers Spend $5Billion a Year on Antibiotics For Their Animals using a Blanket Approach. Medicate all to Prevent Infection. AI Models are Being Used to Identify & Medicate Only Animals That are Actually Sick
    submitted by /u/HODLTID [link] [comments]  ( 40 min )
    I got frustrated with the time and effort required to code and maintain custom web scrapers, so I built an LLM-powered tool that can comprehend any website structure and extract the desired data in the preferred format.
    submitted by /u/madredditscientist [link] [comments]  ( 41 min )
    Generative AI Technology Is Discovering Completely New Drugs
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    Join us this Friday 6 pm EST for a fascinating discussion about the societal impact of large language models (LLMs) like ChatGPT
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer
    submitted by /u/ohmsalad [link] [comments]  ( 40 min )
    Hey guys! I made some of the cartoon characters to look like villains. Can you guess which cartoon they are from?
    submitted by /u/_aimnftri [link] [comments]  ( 40 min )
    Made with AI
    submitted by /u/NorthTs [link] [comments]  ( 40 min )
  • Open

    Probability problem with Pratt prime proofs
    In the process of creating a Pratt certificate to prove that a number n is prime, you have to find a number a that seems kinda arbitrary. As we discussed here, a number n is prime if there exists a number a such that an-1 = 1 mod n and a(n-1)/p ≠ 1 mod n […] Probability problem with Pratt prime proofs first appeared on John D. Cook.  ( 5 min )
    Factoring b^n + 1
    The previous post illustrated a technique for finding factors of number of the form bn – 1. This post will look at an analogous, though slightly less general, technique for numbers of the form bn + 1. There is a theorem that says that if m divides n then bm + 1 divides bn + […] Factoring b^n + 1 first appeared on John D. Cook.  ( 5 min )
    Factoring b^n – 1
    Suppose you want to factor a number of the form bn – 1. There is a theorem that says that if m divides n then bm – 1 divides bn – 1. Let’s use this theorem to try to factor J = 22023 – 1, a 609-digit number. Factoring such a large number would be more difficult if it didn’t have […] Factoring b^n – 1 first appeared on John D. Cook.  ( 6 min )
  • Open

    AI’s Leg Up: Startup Accelerates Robotics Simulation for $8 Trillion Food Market
    Robots are finally getting a grip. Developers have been striving to close the gap on robotic gripping for the past several years, pursuing applications for multibillion-dollar industries. Securely gripping and transferring fast-moving items on conveyor belts holds vast promise for businesses. Soft Robotics, a Bedford, Mass., startup, is harnessing NVIDIA Isaac Sim to help close Read article >  ( 6 min )
    The Ultimate Upgrade: GeForce RTX 4080 SuperPOD Rollout Begins Today
    The Ultimate upgrade begins today: GeForce NOW RTX 4080 SuperPODs are now rolling out, bringing a new level of high-performance gaming to the cloud. Ultimate members will start to see RTX 4080 performance in their region soon, and experience titles like  Warhammer 40,000: Darktide, Cyberpunk 2077, The Witcher 3: Wild Hunt and more at ultimate Read article >  ( 5 min )
  • Open

    Human-Timescale Adaptation in an Open-Ended Task Space - (AdA) - DeepMind 2023 - Can adapt to open-ended novel embodied 3D problems as quickly as humans!
    Paper: https://arxiv.org/abs/2301.07608 Youtube: https://www.youtube.com/watch?v=U93bUQ1roiw Please watch the Video the explanations are better than me giving you 3-5 Pictures! Abstract: Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. https://preview.redd.it/rcta0vvt80da1.jpg?width=1424&format=pjpg&auto=webp&s=da7cc5745a21969b1687b7cf2a8c590dcac72ae0 submitted by /u/Singularian2501 [link] [comments]  ( 41 min )
    On the legal status of downloading and using ATARI 2600 ROMs
    Hi there, I have searched online quite broadly, and could not find an answer anywhere on the following questions: Is it legal to download ATARI ROMs, e.g., the ones in https://github.com/Farama-Foundation/AutoROM? Is it legal to use those ROMs? (Imagine I acquired them in a different way than downloading, like physical shipping on USB drive) If both are illegal, what's the legal situation around using them for reinforcement learning research? What I know from reading: Downloading ROMs is a gray area. It is supposedly illegal, but if the copyright holder knows about a use of their ROMs and doesn't do anything, there exist and "implied license" allowing their use. Source: here. Providing a means to download these ROMs is not illegal, it is allowed. Source: here. ​ Does anyone have more info or legal experience with this? ​ Thanks submitted by /u/Conscious_Heron_9133 [link] [comments]  ( 42 min )
    trying to make a single actuated link stay upright with dqn. The maximum score you can get from the game is 20,000, and the highest score I am getting is 200.
    submitted by /u/blackgentrifier [link] [comments]  ( 40 min )
  • Open

    Equivariant Networks for Crystal Structures. (arXiv:2211.15420v2 [cond-mat.mtrl-sci] UPDATED)
    Supervised learning with deep models has tremendous potential for applications in materials science. Recently, graph neural networks have been used in this context, drawing direct inspiration from models for molecules. However, materials are typically much more structured than molecules, which is a feature that these models do not leverage. In this work, we introduce a class of models that are equivariant with respect to crystalline symmetry groups. We do this by defining a generalization of the message passing operations that can be used with more general permutation groups, or that can alternatively be seen as defining an expressive convolution operation on the crystal graph. Empirically, these models achieve competitive results with state-of-the-art on property prediction tasks.
    Multi-Task Imitation Learning for Linear Dynamical Systems. (arXiv:2212.00186v2 [cs.LG] UPDATED)
    We study representation learning for efficient imitation learning over linear systems. In particular, we consider a setting where learning is split into two phases: (a) a pre-training step where a shared $k$-dimensional representation is learned from $H$ source policies, and (b) a target policy fine-tuning step where the learned representation is used to parameterize the policy class. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $\tilde{O}\left( \frac{k n_x}{HN_{\mathrm{shared}}} + \frac{k n_u}{N_{\mathrm{target}}}\right)$, where $n_x > k$ is the state dimension, $n_u$ is the input dimension, $N_{\mathrm{shared}}$ denotes the total amount of data collected for each policy during representation learning, and $N_{\mathrm{target}}$ is the amount of target task data. This result formalizes the intuition that aggregating data across related tasks to learn a representation can significantly improve the sample efficiency of learning a target task. The trends suggested by this bound are corroborated in simulation.
    FedALA: Adaptive Local Aggregation for Personalized Federated Learning. (arXiv:2212.01197v2 [cs.LG] UPDATED)
    A key challenge in federated learning (FL) is the statistical heterogeneity that impairs the generalization of the global model on each client. To address this, we propose a method Federated learning with Adaptive Local Aggregation (FedALA) by capturing the desired information in the global model for client models in personalized FL. The key component of FedALA is an Adaptive Local Aggregation (ALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client to initialize the local model before training in each iteration. To evaluate the effectiveness of FedALA, we conduct extensive experiments with five benchmark datasets in computer vision and natural language processing domains. FedALA outperforms eleven state-of-the-art baselines by up to 3.27% in test accuracy. Furthermore, we also apply ALA module to other federated learning methods and achieve up to 24.19% improvement in test accuracy.
    FeSAC: Federated Learning-Based Soft Actor-Critic Traffic Offloading in Space-Air-Ground Integrated Network. (arXiv:2212.02075v2 [cs.NI] UPDATED)
    With the increase of intelligent devices leading to increasing demand for traffic, traffic offloading has become a challenging problem. The space-air-ground integrated network (SAGIN) is a superior network architecture to solve this problem. The existing research on SAGIN traffic offloading only considers the single-layer satellite network in the space network. To further expand the resource pool of traffic offloading in SAGIN, we extend the single-layer satellite network into a double-layer satellite network composed of low-orbit satellites (LEO) and high-orbit satellites (GEO). And re-model a four-layer SAGIN architecture consisting of the ground network, the air network, LEO and GEO. Furthermore, we propose a novel Federated Soft Actor-Critic (FeSAC) traffic offloading method with positive environmental exploration to accommodate this dynamic and complex four-layer SAGIN architecture. The FeSAC method uses federated learning to train traffic offloading nodes and then aggregate the training results to obtain the best traffic offloading strategy. The simulation results show that under the four-layer SAGIN, our proposed method can better adapt to the network environment changes by nodes mobility and is better than the existing traffic offloading methods in throughput, packet loss, and transmission delay.
    Beyond ADMM: A Unified Client-variance-reduced Adaptive Federated Learning Framework. (arXiv:2212.01519v2 [cs.LG] UPDATED)
    As a novel distributed learning paradigm, federated learning (FL) faces serious challenges in dealing with massive clients with heterogeneous data distribution and computation and communication resources. Various client-variance-reduction schemes and client sampling strategies have been respectively introduced to improve the robustness of FL. Among others, primal-dual algorithms such as the alternating direction of method multipliers (ADMM) have been found being resilient to data distribution and outperform most of the primal-only FL algorithms. However, the reason behind remains a mystery still. In this paper, we firstly reveal the fact that the federated ADMM is essentially a client-variance-reduced algorithm. While this explains the inherent robustness of federated ADMM, the vanilla version of it lacks the ability to be adaptive to the degree of client heterogeneity. Besides, the global model at the server under client sampling is biased which slows down the practical convergence. To go beyond ADMM, we propose a novel primal-dual FL algorithm, termed FedVRA, that allows one to adaptively control the variance-reduction level and biasness of the global model. In addition, FedVRA unifies several representative FL algorithms in the sense that they are either special instances of FedVRA or are close to it. Extensions of FedVRA to semi/un-supervised learning are also presented. Experiments based on (semi-)supervised image classification tasks demonstrate superiority of FedVRA over the existing schemes in learning scenarios with massive heterogeneous clients and client sampling.
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v2 [cs.LG] UPDATED)
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D molecular graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D molecular graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.
    Unbalanced Optimal Transport, from Theory to Numerics. (arXiv:2211.08775v2 [stat.ML] UPDATED)
    Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.
    Algorithmic progress in computer vision. (arXiv:2212.05153v3 [cs.CV] UPDATED)
    We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms. Using Shapley values to attribute performance improvements, we find that algorithmic improvements have been roughly as important as the scaling of compute for progress computer vision. Our estimates indicate that algorithmic innovations mostly take the form of compute-augmenting algorithmic advances (which enable researchers to get better performance from less compute), not data-augmenting algorithmic advances. We find that compute-augmenting algorithmic advances are made at a pace more than twice as fast as the rate usually associated with Moore's law. In particular, we estimate that compute-augmenting innovations halve compute requirements every nine months (95\% confidence interval: 4 to 25 months).
    Accelerated Riemannian Optimization: Handling Constraints with a Prox to Bound Geometric Penalties. (arXiv:2211.14645v2 [math.OC] UPDATED)
    We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in a wide class of Hadamard manifolds. We achieve the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works \emph{resort to assuming} that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods of limited applicability. For our manifolds, this solves the open question in [KY22] about obtaining global general acceleration without iterates assumptively staying in the feasible set. In our solution, we design an accelerated Riemannian inexact proximal point algorithm, which is a result that was unknown even with exact access to the proximal operator, and is of independent interest. For smooth functions, we show we can implement the prox step inexactly with first-order methods in Riemannian balls of certain diameter that is enough for global accelerated optimization.
    The Benefits of Model-Based Generalization in Reinforcement Learning. (arXiv:2211.02222v2 [cs.LG] UPDATED)
    Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
    Geometric Knowledge Distillation: Topology Compression for Graph Neural Networks. (arXiv:2210.13014v2 [cs.LG] UPDATED)
    We study a new paradigm of knowledge transfer that aims at encoding graph topological information into graph neural networks (GNNs) by distilling knowledge from a teacher GNN model trained on a complete graph to a student GNN model operating on a smaller or sparser graph. To this end, we revisit the connection between thermodynamics and the behavior of GNN, based on which we propose Neural Heat Kernel (NHK) to encapsulate the geometric property of the underlying manifold concerning the architecture of GNNs. A fundamental and principled solution is derived by aligning NHKs on teacher and student models, dubbed as Geometric Knowledge Distillation. We develop non- and parametric instantiations and demonstrate their efficacy in various experimental settings for knowledge distillation regarding different types of privileged topological information and teacher-student schemes.
    SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning. (arXiv:2211.12509v2 [cs.LG] UPDATED)
    Recent years have witnessed remarkable advances in spatiotemporal predictive learning, incorporating auxiliary inputs, elaborate neural architectures, and sophisticated training strategies. Although impressive, the system complexity of mainstream methods is increasing as well, which may hinder the convenient applications. This paper proposes SimVP, a simple spatiotemporal predictive baseline model that is completely built upon convolutional networks without recurrent architectures and trained by common mean squared error loss in an end-to-end fashion. Without introducing any extra tricks and strategies, SimVP can achieve superior performance on various benchmark datasets. To further improve the performance, we derive variants with the gated spatiotemporal attention translator from SimVP that can achieve better performance. We demonstrate that SimVP has strong generalization and extensibility on real-world datasets through extensive experiments. The significant reduction in training cost makes it easier to scale to complex scenarios. We believe SimVP can serve as a solid baseline to benefit the spatiotemporal predictive learning community.
    torchode: A Parallel ODE Solver for PyTorch. (arXiv:2210.12375v2 [cs.LG] UPDATED)
    We introduce an ODE solver for the PyTorch ecosystem that can solve multiple ODEs in parallel independently from each other while achieving significant performance gains. Our implementation tracks each ODE's progress separately and is carefully optimized for GPUs and compatibility with PyTorch's JIT compiler. Its design lets researchers easily augment any aspect of the solver and collect and analyze internal solver statistics. In our experiments, our implementation is up to 4.3 times faster per step than other ODE solvers and it is robust against within-batch interactions that lead other solvers to take up to 4 times as many steps. Code available at https://github.com/martenlienen/torchode
    Black-box Coreset Variational Inference. (arXiv:2211.02377v2 [stat.ML] UPDATED)
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.
    Synthetic Dataset Generation for Privacy-Preserving Machine Learning. (arXiv:2210.03205v4 [cs.CR] UPDATED)
    Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets cannot be publicly released if they contain sensitive information such as medical records, and data privacy becomes a major concern. Encryption methods could be a possible solution, however their deployment on ML applications seriously impacts classification accuracy and results in substantial computational overhead. Alternatively, obfuscation techniques could be used, but maintaining a good trade-off between visual privacy and accuracy is challenging. In this paper, we propose a method to generate secure synthetic datasets from the original private datasets. Given a network with Batch Normalization (BN) layers pretrained on the original dataset, we first record the class-wise BN layer statistics. Next, we generate the synthetic dataset by optimizing random noise such that the synthetic data match the layer-wise statistical distribution of original images. We evaluate our method on image classification datasets (CIFAR10, ImageNet) and show that synthetic data can be used in place of the original CIFAR10/ImageNet data for training networks from scratch, producing comparable classification performance. Further, to analyze visual privacy provided by our method, we use Image Quality Metrics and show high degree of visual dissimilarity between the original and synthetic images. Moreover, we show that our proposed method preserves data-privacy under various privacy-leakage attacks including Gradient Matching Attack, Model Memorization Attack, and GAN-based Attack.
    Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. (arXiv:2211.14238v2 [cs.LG] UPDATED)
    Distribution shift occurs when the test distribution differs from the training distribution, and it can considerably degrade performance of machine learning models deployed in the real world. Temporal shifts -- distribution shifts arising from the passage of time -- often occur gradually and have the additional structure of timestamp metadata. By leveraging timestamp metadata, models can potentially learn from trends in past distribution shifts and extrapolate into the future. While recent works have studied distribution shifts, temporal shifts remain underexplored. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning. We use two evaluation strategies: evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation strategy, aims to provide a simple evaluation protocol, while Eval-Stream is more realistic for certain real-world applications. Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data. Existing methods are unable to close this gap. Code is available at https://wild-time.github.io/.
    Local Bayesian optimization via maximizing probability of descent. (arXiv:2210.11662v2 [cs.LG] UPDATED)
    Local optimization presents a promising approach to expensive, high-dimensional black-box optimization by sidestepping the need to globally explore the search space. For objective functions whose gradient cannot be evaluated directly, Bayesian optimization offers one solution -- we construct a probabilistic model of the objective, design a policy to learn about the gradient at the current location, and use the resulting information to navigate the objective landscape. Previous work has realized this scheme by minimizing the variance in the estimate of the gradient, then moving in the direction of the expected gradient. In this paper, we re-examine and refine this approach. We demonstrate that, surprisingly, the expected value of the gradient is not always the direction maximizing the probability of descent, and in fact, these directions may be nearly orthogonal. This observation then inspires an elegant optimization scheme seeking to maximize the probability of descent while moving in the direction of most-probable descent. Experiments on both synthetic and real-world objectives show that our method outperforms previous realizations of this optimization scheme and is competitive against other, significantly more complicated baselines.
    Betting the system: Using lineups to predict football scores. (arXiv:2210.06327v3 [cs.LG] UPDATED)
    This paper aims to reduce randomness in football by analysing the role of lineups in final scores using machine learning prediction models we have developed. Football clubs invest millions of dollars on lineups and knowing how individual statistics translate to better outcomes can optimise investments. Moreover, sports betting is growing exponentially and being able to predict the future is profitable and desirable. We use machine learning models and historical player data from English Premier League (2020-2022) to predict scores and to understand how individual performance can improve the outcome of a match. We compared different prediction techniques to maximise the possibility of finding useful models. We created heuristic and machine learning models predicting football scores to compare different techniques. We used different sets of features and shown goalkeepers stats are more important than attackers stats to predict goals scored. We applied a broad evaluation process to assess the efficacy of the models in real world applications. We managed to predict correctly all relegated teams after forecast 100 consecutive matches. We show that Support Vector Regression outperformed other techniques predicting final scores and that lineups do not improve predictions. Finally, our model was profitable (42% return) when emulating a betting system using real world odds data.
    Towards Out-of-Distribution Sequential Event Prediction: A Causal Treatment. (arXiv:2210.13005v2 [cs.LG] UPDATED)
    The goal of sequential event prediction is to estimate the next event based on a sequence of historical events, with applications to sequential recommendation, user behavior analysis and clinical treatment. In practice, the next-event prediction models are trained with sequential data collected at one time and need to generalize to newly arrived sequences in remote future, which requires models to handle temporal distribution shift from training to testing. In this paper, we first take a data-generating perspective to reveal a negative result that existing approaches with maximum likelihood estimation would fail for distribution shift due to the latent context confounder, i.e., the common cause for the historical events and the next event. Then we devise a new learning objective based on backdoor adjustment and further harness variational inference to make it tractable for sequence learning problems. On top of that, we propose a framework with hierarchical branching structures for learning context-specific representations. Comprehensive experiments on diverse tasks (e.g., sequential recommendation) demonstrate the effectiveness, applicability and scalability of our method with various off-the-shelf models as backbones.
    An Exponentially Converging Particle Method for the Mixed Nash Equilibrium of Continuous Games. (arXiv:2211.01280v2 [math.OC] UPDATED)
    We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as a time-implicit discretization of the "interacting" Wasserstein-Fisher-Rao gradient flow. We prove that, under non-degeneracy assumptions, this method converges at an exponential rate to the exact mixed Nash equilibrium from any initialization satisfying a natural notion of closeness to optimality. We illustrate our results with numerical experiments and discuss applications to max-margin and distributionally-robust classification using two-layer neural networks, where our method has a natural interpretation as a simultaneous training of the network's weights and of the adversarial distribution.
    Keypoint-GraspNet: Keypoint-based 6-DoF Grasp Generation from the Monocular RGB-D input. (arXiv:2209.08752v2 [cs.RO] UPDATED)
    Great success has been achieved in the 6-DoF grasp learning from the point cloud input, yet the computational cost due to the point set orderlessness remains a concern. Alternatively, we explore the grasp generation from the RGB-D input in this paper. The proposed solution, Keypoint-GraspNet, detects the projection of the gripper keypoints in the image space and then recover the SE(3) poses with a PnP algorithm. A synthetic dataset based on the primitive shape and the grasp family is constructed to examine our idea. Metric-based evaluation reveals that our method outperforms the baselines in terms of the grasp proposal accuracy, diversity, and the time cost. Finally, robot experiments show high success rate, demonstrating the potential of the idea in the real-world applications.
    Deep Counterfactual Estimation with Categorical Background Variables. (arXiv:2210.05811v4 [cs.LG] UPDATED)
    Referred to as the third rung of the causal inference ladder, counterfactual queries typically ask the "What if ?" question retrospectively. The standard approach to estimate counterfactuals resides in using a structural equation model that accurately reflects the underlying data generating process. However, such models are seldom available in practice and one usually wishes to infer them from observational data alone. Unfortunately, the correct structural equation model is in general not identifiable from the observed factual distribution. Nevertheless, in this work, we show that under the assumption that the main latent contributors to the treatment responses are categorical, the counterfactuals can be still reliably predicted. Building upon this assumption, we introduce CounterFactual Query Prediction (CFQP), a novel method to infer counterfactuals from continuous observations when the background variables are categorical. We show that our method significantly outperforms previously available deep-learning-based counterfactual methods, both theoretically and empirically on time series and image data. Our code is available at https://github.com/edebrouwer/cfqp.
    Automatic Generation of Product Concepts from Positive Examples, with an Application to Music Streaming. (arXiv:2210.01515v3 [cs.LG] UPDATED)
    Internet based businesses and products (e.g. e-commerce, music streaming) are becoming more and more sophisticated every day with a lot of focus on improving customer satisfaction. A core way they achieve this is by providing customers with an easy access to their products by structuring them in catalogues using navigation bars and providing recommendations. We refer to these catalogues as product concepts, e.g. product categories on e-commerce websites, public playlists on music streaming platforms. These product concepts typically contain products that are linked with each other through some common features (e.g. a playlist of songs by the same artist). How they are defined in the backend of the system can be different for different products. In this work, we represent product concepts using database queries and tackle two learning problems. First, given sets of products that all belong to the same unknown product concept, we learn a database query that is a representation of this product concept. Second, we learn product concepts and their corresponding queries when the given sets of products are associated with multiple product concepts. To achieve these goals, we propose two approaches that combine the concepts of PU learning with Decision Trees and Clustering. Our experiments demonstrate, via a simulated setup for a music streaming service, that our approach is effective in solving these problems.
    Improved Bounds on Neural Complexity for Representing Piecewise Linear Functions. (arXiv:2210.07236v3 [cs.LG] UPDATED)
    A deep neural network using rectified linear units represents a continuous piecewise linear (CPWL) function and vice versa. Recent results in the literature estimated that the number of neurons needed to exactly represent any CPWL function grows exponentially with the number of pieces or exponentially in terms of the factorial of the number of distinct linear components. Moreover, such growth is amplified linearly with the input dimension. These existing results seem to indicate that the cost of representing a CPWL function is expensive. In this paper, we propose much tighter bounds and establish a polynomial time algorithm to find a network satisfying these bounds for any given CPWL function. We prove that the number of hidden neurons required to exactly represent any CPWL function is at most a quadratic function of the number of pieces. In contrast to all previous results, this upper bound is invariant to the input dimension. Besides the number of pieces, we also study the number of distinct linear components in CPWL functions. When such a number is also given, we prove that the quadratic complexity turns into bilinear, which implies a lower neural complexity because the number of distinct linear components is always not greater than the minimum number of pieces in a CPWL function. When the number of pieces is unknown, we prove that, in terms of the number of distinct linear components, the neural complexities of any CPWL function are at most polynomial growth for low-dimensional inputs and factorial growth for the worst-case scenario, which are significantly better than existing results in the literature.
    Semi-Supervised Junction Tree Variational Autoencoder for Molecular Property Prediction. (arXiv:2208.05119v5 [cs.LG] UPDATED)
    Molecular Representation Learning is essential to solving many drug discovery and computational chemistry problems. It is a challenging problem due to the complex structure of molecules and the vast chemical space. Graph representations of molecules are more expressive than traditional representations, such as molecular fingerprints. Therefore, they can improve the performance of machine learning models. We propose SeMole, a method that augments the Junction Tree Variational Autoencoders, a state-of-the-art generative model for molecular graphs, with semi-supervised learning. SeMole aims to improve the accuracy of molecular property prediction when having limited labeled data by exploiting unlabeled data. We enforce that the model generates molecular graphs conditioned on target properties by incorporating the property into the latent representation. We propose an additional pre-training phase to improve the training process for our semi-supervised generative model. We perform an experimental evaluation on the ZINC dataset using three different molecular properties and demonstrate the benefits of semi-supervision.
    Neural Observer with Lyapunov Stability Guarantee for Uncertain Nonlinear Systems. (arXiv:2208.13006v2 [math.OC] UPDATED)
    In this paper, we propose a novel nonlinear observer based on neural networks, called neural observer, for observation tasks of linear time-invariant (LTI) systems and uncertain nonlinear systems. In particular, the neural observer designed for uncertain systems is inspired by the active disturbance rejection control, which can measure the uncertainty in real-time. The stability analysis (e.g., exponential convergence rate) of LTI and uncertain nonlinear systems (involving neural observers) are presented and guaranteed, where it is shown that the observation problems can be solved only using the linear matrix inequalities (LMIs). Also, it is revealed that the observability and controllability of the system matrices are required to demonstrate the existence of solutions of LMIs. Finally, the effectiveness of neural observers is verified on three simulation cases, including the X-29A aircraft model, the nonlinear pendulum, and the four-wheel steering vehicle.
    Brain Tumor Segmentation using Enhanced U-Net Model with Empirical Analysis. (arXiv:2210.13336v2 [eess.IV] UPDATED)
    Cancer of the brain is deadly and requires careful surgical segmentation. The brain tumors were segmented using U-Net using a Convolutional Neural Network (CNN). When looking for overlaps of necrotic, edematous, growing, and healthy tissue, it might be hard to get relevant information from the images. The 2D U-Net network was improved and trained with the BraTS datasets to find these four areas. U-Net can set up many encoder and decoder routes that can be used to get information from images that can be used in different ways. To reduce computational time, we use image segmentation to exclude insignificant background details. Experiments on the BraTS datasets show that our proposed model for segmenting brain tumors from MRI (MRI) works well. In this study, we demonstrate that the BraTS datasets for 2017, 2018, 2019, and 2020 do not significantly differ from the BraTS 2019 dataset's attained dice scores of 0.8717 (necrotic), 0.9506 (edema), and 0.9427 (enhancing).
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v2 [cs.CV] UPDATED)
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.
    Competition, Alignment, and Equilibria in Digital Marketplaces. (arXiv:2208.14423v2 [cs.GT] UPDATED)
    Competition between traditional platforms is known to improve user utility by aligning the platform's actions with user preferences. But to what extent is alignment exhibited in data-driven marketplaces? To study this question from a theoretical perspective, we introduce a duopoly market where platform actions are bandit algorithms and the two platforms compete for user participation. A salient feature of this market is that the quality of recommendations depends on both the bandit algorithm and the amount of data provided by interactions from users. This interdependency between the algorithm performance and the actions of users complicates the structure of market equilibria and their quality in terms of user utility. Our main finding is that competition in this market does not perfectly align market outcomes with user utility. Interestingly, market outcomes exhibit misalignment not only when the platforms have separate data repositories, but also when the platforms have a shared data repository. Nonetheless, the data sharing assumptions impact what mechanism drives misalignment and also affect the specific form of misalignment (e.g. the quality of the best-case and worst-case market outcomes). More broadly, our work illustrates that competition in digital marketplaces has subtle consequences for user utility that merit further investigation.
    Pattern Attention Transformer with Doughnut Kernel. (arXiv:2211.16961v2 [cs.CV] UPDATED)
    We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. In ViT, an image is cut into square-shaped patches. As the follow-up of ViT, Swin Transformer proposes an additional step of shifting to decrease the existence of fixed boundaries, which also incurs 'two connected Swin Transformer blocks' as the minimum unit of the model. Inheriting the patch/window idea, our doughnut kernel enhances the design of patches further. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels beyond square. To verify its performance on image classification, PAT is designed with Transformer blocks of regular octagon shape doughnut kernels. Its architecture is lighter: the minimum pattern attention layer is only one for each stage. Under similar complexity of computation, its performances on ImageNet 1K reach higher throughput (+10\%) and surpass Swin Transformer (+0.1 acc1).
    Breast Cancer Classification using Deep Learned Features Boosted with Handcrafted Features. (arXiv:2206.12815v2 [eess.IV] UPDATED)
    Breast cancer is one of the leading causes of death among women across the globe. It is difficult to treat if detected at advanced stages, however, early detection can significantly increase chances of survival and improves lives of millions of women. Given the widespread prevalence of breast cancer, it is of utmost importance for the research community to come up with the framework for early detection, classification and diagnosis. Artificial intelligence research community in coordination with medical practitioners are developing such frameworks to automate the task of detection. With the surge in research activities coupled with availability of large datasets and enhanced computational powers, it expected that AI framework results will help even more clinicians in making correct predictions. In this article, a novel framework for classification of breast cancer using mammograms is proposed. The proposed framework combines robust features extracted from novel Convolutional Neural Network (CNN) features with handcrafted features including HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern). The obtained results on CBIS-DDSM dataset exceed state of the art.
    Quantized Training of Gradient Boosting Decision Trees. (arXiv:2207.09682v2 [cs.LG] UPDATED)
    Recent years have witnessed significant success in Gradient Boosting Decision Trees (GBDT) for a wide range of machine learning applications. Generally, a consensus about GBDT's training algorithms is gradients and statistics are computed based on high-precision floating points. In this paper, we investigate an essentially important question which has been largely ignored by the previous literature: how many bits are needed for representing gradients in training GBDT? To solve this mystery, we propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. Surprisingly, both our theoretical analysis and empirical studies show that the necessary precisions of gradients without hurting any performance can be quite low, e.g., 2 or 3 bits. With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits. Promisingly, these findings may pave the way for much more efficient training of GBDT from several aspects: (1) speeding up the computation of gradient statistics in histograms; (2) compressing the communication cost of high-precision statistical information during distributed training; (3) the inspiration of utilization and development of hardware architectures which well support low-precision computation for GBDT training. Benchmarked on CPUs, GPUs, and distributed clusters, we observe up to 2$\times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets, demonstrating the effectiveness and potential of the low-precision training of GBDT. The code will be released to the official repository of LightGBM.
    Progressive Domain Adaptation with Contrastive Learning for Object Detection in the Satellite Imagery. (arXiv:2209.02564v2 [cs.CV] UPDATED)
    Images in aerial datasets are very large in resolution, and each frame contains many dense and small objects. State-of-the-art detection methods fail to capture small objects, local features, and region proposals for densely overlapped objects in aerial imagery due to the high variation of object sizes in satellite imagery with respect to the image size and high variation of content. Aerial imagery content varies greatly within the dataset due to the large change in lighting conditions, and the type of ground imagery captures from high altitudes. The variation is even higher between different datasets as object sizes, class distributions, image acquisition, and weather conditions can vary even more drastically. Thus, Domain Adaptation (DA) has been introduced as a band-aid to alleviate the degradation of object identification in previously unseen datasets. In this paper, we propose a small object detection pipeline that improves the feature extraction process by spatial pyramid pooling, cross-stage partial networks, heat-map-based region proposal network, and objects localization and identification through a novel image difficulty score that adapts the overall focal loss measure based on the image difficulty. Next, we propose novel contrastive learning with progressive domain adaptation to produce domain-invariant features across aerial datasets using local and global features. Effective analysis and illustration of different performance metrics and challenges show that our proposed method is comparable to the current State-of-Art models and creates a first-ever Domain Adaptation benchmark for the object detection task in highly imbalanced satellite datasets with large domain gaps and dominant small objects.
    Sequence Model Imitation Learning with Unobserved Contexts. (arXiv:2208.02225v3 [cs.LG] UPDATED)
    We consider imitation learning problems where the learner's ability to mimic the expert increases throughout the course of an episode as more information is revealed. One example of this is when the expert has access to privileged information: while the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the hidden context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods. This is because on-policy algorithms provably learn to recover from their initially suboptimal actions, while off-policy methods treat their suboptimal past actions as though they came from the expert. This often manifests as a latching behavior: a naive repetition of past actions. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.
    ViNL: Visual Navigation and Locomotion Over Obstacles. (arXiv:2210.14791v2 [cs.RO] UPDATED)
    We present Visual Navigation and Locomotion over obstacles (ViNL), which enables a quadrupedal robot to navigate unseen apartments while stepping over small obstacles that lie in its path (e.g., shoes, toys, cables), similar to how humans and pets lift their feet over objects as they walk. ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands. Both the policies are entirely "model-free", i.e. sensors-to-actions neural networks trained end-to-end. The two are trained independently in two entirely different simulators and then seamlessly co-deployed by feeding the velocity commands from the navigator to the locomotor, entirely "zero-shot" (without any co-training). While prior works have developed learning methods for visual navigation or visual locomotion, to the best of our knowledge, this is the first fully learned approach that leverages vision to accomplish both (1) intelligent navigation in new environments, and (2) intelligent visual locomotion that aims to traverse cluttered environments without disrupting obstacles. On the task of navigation to distant goals in unknown environments, ViNL using just egocentric vision significantly outperforms prior work on robust locomotion using privileged terrain maps (+32.8% success and -4.42 collisions per meter). Additionally, we ablate our locomotion policy to show that each aspect of our approach helps reduce obstacle collisions. Videos and code at this http URL
    Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment. (arXiv:2208.08726v2 [eess.SP] UPDATED)
    A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is exploited for graph filtering. However, existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. In this paper, we show that for datasets with strong inherent anti-correlations, a suitable graph contains both positive and negative edge weights. In response, we propose a linear-time signed graph sampling method centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix $\bar{\bf{C}}$, we first learn a sparse inverse matrix (graph Laplacian) $\mathcal{L}$ corresponding to a signed graph $\mathcal{G}$. We define the eigenvectors of Laplacian $\mathcal{L}_B$ for a balanced signed graph $\mathcal{G}_B$ -- approximating $\mathcal{G}$ via edge weight augmentation -- as graph frequency components. Next, we choose samples to minimize the low-pass filter reconstruction error in two steps. We first align all Gershgorin disc left-ends of Laplacian $\mathcal{L}_B$ at smallest eigenvalue $\lambda_{\min}(\mathcal{L}_B)$ via similarity transform $\mathcal{L}_p = \S \mathcal{L}_B \S^{-1}$, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on $\mathcal{L}_p$ using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experimental results show that our signed graph sampling method outperformed existing fast sampling schemes noticeably on various datasets.
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v2 [math.OC] UPDATED)
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v3 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
    Self-Supervised Learning for Anomalous Channel Detection in EEG Graphs: Application to Seizure Analysis. (arXiv:2208.07448v4 [cs.LG] UPDATED)
    Electroencephalogram (EEG) signals are effective tools towards seizure analysis where one of the most important challenges is accurate detection of seizure events and brain regions in which seizure happens or initiates. However, all existing machine learning-based algorithms for seizure analysis require access to the labeled seizure data while acquiring labeled data is very labor intensive, expensive, as well as clinicians dependent given the subjective nature of the visual qualitative interpretation of EEG signals. In this paper, we propose to detect seizure channels and clips in a self-supervised manner where no access to the seizure data is needed. The proposed method considers local structural and contextual information embedded in EEG graphs by employing positive and negative sub-graphs. We train our method through minimizing contrastive and generative losses. The employ of local EEG sub-graphs makes the algorithm an appropriate choice when accessing to the all EEG channels is impossible due to complications such as skull fractures. We conduct an extensive set of experiments on the largest seizure dataset and demonstrate that our proposed framework outperforms the state-of-the-art methods in the EEG-based seizure study. The proposed method is the only study that requires no access to the seizure data in its training phase, yet establishes a new state-of-the-art to the field, and outperforms all related supervised methods.
    Dilated Neighborhood Attention Transformer. (arXiv:2209.15001v3 [cs.CV] UPDATED)
    Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).
    Tightening Discretization-based MILP Models for the Pooling Problem using Upper Bounds on Bilinear Terms. (arXiv:2207.03699v2 [math.OC] UPDATED)
    Discretization-based methods have been proposed for solving nonconvex optimization problems with bilinear terms such as the pooling problem. These methods convert the original nonconvex optimization problems into mixed-integer linear programs (MILPs). In this paper we study tightening methods for these MILP models for the pooling problem, and derive valid constraints using upper bounds on bilinear terms. Computational results demonstrate the effectiveness of our methods in terms of reducing solution time.
    Learning with Muscles: Benefits for Data-Efficiency and Robustness in Anthropomorphic Tasks. (arXiv:2207.03952v2 [cs.RO] UPDATED)
    Humans are able to outperform robots in terms of robustness, versatility, and learning of new tasks in a wide variety of movements. We hypothesize that highly nonlinear muscle dynamics play a large role in providing inherent stability, which is favorable to learning. While recent advances have been made in applying modern learning techniques to muscle-actuated systems both in simulation as well as in robotics, so far, no detailed analysis has been performed to show the benefits of muscles when learning from scratch. Our study closes this gap and showcases the potential of muscle actuators for core robotics challenges in terms of data-efficiency, hyperparameter sensitivity, and robustness.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v3 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v3 [cs.LG] UPDATED)
    There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically easy but computationally hard. Empirically, we find that a variety of neural networks successfully learn sparse parities, with discontinuous phase transitions in the training curves. On small instances, learning abruptly occurs at approximately $n^{O(k)}$ iterations; this nearly matches SQ lower bounds, despite the apparent lack of a sparse prior. Our theoretical analysis shows that these observations are not explained by a Langevin-like mechanism, whereby SGD "stumbles in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time). Instead, we show that SGD gradually amplifies the sparse solution via a Fourier gap in the population gradient, making continual progress that is invisible to loss and error metrics.
    OpenXAI: Towards a Transparent Evaluation of Model Explanations. (arXiv:2206.11104v3 [cs.LG] UPDATED)
    While several types of post hoc explanation methods (e.g., feature attribution methods) have been proposed in recent literature, there is little to no work on systematically benchmarking these methods in an efficient and transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible open source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, (ii) open-source implementations of twenty-two quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, and (iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. OpenXAI datasets and data loaders, implementations of state-of-the-art explanation methods and evaluation metrics, as well as leaderboards are publicly available at https://open-xai.github.io/.
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v2 [math.OC] UPDATED)
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the nonconvex minimax problems with coupled linear constraints.
    Learning Deep Input-Output Stable Dynamics. (arXiv:2206.13093v3 [math.DS] UPDATED)
    Learning stable dynamics from observed time-series data is an essential problem in robotics, physical modeling, and systems biology. Many of these dynamics are represented as an inputs-output system to communicate with the external environment. In this study, we focus on input-output stable systems, exhibiting robustness against unexpected stimuli and noise. We propose a method to learn nonlinear systems guaranteeing the input-output stability. Our proposed method utilizes the differentiable projection onto the space satisfying the Hamilton-Jacobi inequality to realize the input-output stability. The problem of finding this projection can be formulated as a quadratic constraint quadratic programming problem, and we derive the particular solution analytically. Also, we apply our method to a toy bistable model and the task of training a benchmark generated from a glucose-insulin simulator. The results show that the nonlinear system with neural networks by our method achieves the input-output stability, unlike naive neural networks. Our code is available at https://github.com/clinfo/DeepIOStability.
    Evaluating Explainability for Graph Neural Networks. (arXiv:2208.09339v2 [cs.LG] UPDATED)
    As post hoc explanations are increasingly used to understand the behavior of graph neural networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and several real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition to synthetic and real-world graph datasets with ground-truth explanations, GraphXAI provides data loaders, data processing functions, visualizers, GNN model implementations, and evaluation metrics to benchmark the performance of GNN explainability methods.
    On the effectiveness of persistent homology. (arXiv:2206.10551v3 [math.AT] UPDATED)
    Persistent homology (PH) is one of the most popular methods in Topological Data Analysis. Even though PH has been used in many different types of applications, the reasons behind its success remain elusive; in particular, it is not known for which classes of problems it is most effective, or to what extent it can detect geometric or topological features. The goal of this work is to identify some types of problems where PH performs well or even better than other methods in data analysis. We consider three fundamental shape analysis tasks: the detection of the number of holes, curvature and convexity from 2D and 3D point clouds sampled from shapes. Experiments demonstrate that PH is successful in these tasks, outperforming several baselines, including PointNet, an architecture inspired precisely by the properties of point clouds. In addition, we observe that PH remains effective for limited computational resources and limited training data, as well as out-of-distribution test data, including various data transformations and noise. For convexity detection, we provide a theoretical guarantee that PH is effective for this task in $\mathbb{R}^d$, and demonstrate the detection of a convexity measure on the FLAVIA data set of plant leaf images. Due to the crucial role of shape classification in understanding mathematical and physical structures and objects, and in many applications, the findings of this work will provide some knowledge about the types of problems that are appropriate for PH, so that it can - to borrow the words from Wigner 1960 - ``remain valid in future research, and extend, to our pleasure", but to our lesser bafflement, to a variety of applications.
    Continual Prune-and-Select: Class-incremental learning with specialized subnetworks. (arXiv:2208.04952v2 [cs.LG] UPDATED)
    The human brain is capable of learning tasks sequentially mostly without forgetting. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning one task after another. We address this challenge considering a class-incremental learning scenario where the DNN sees test data without knowing the task from which this data originates. During training, Continual-Prune-and-Select (CP&S) finds a subnetwork within the DNN that is responsible for solving a given task. Then, during inference, CP&S selects the correct subnetwork to make predictions for that task. A new task is learned by training available neuronal connections of the DNN (previously untrained) to create a new subnetwork by pruning, which can include previously trained connections belonging to other subnetwork(s) because it does not update shared connections. This enables to eliminate catastrophic forgetting by creating specialized regions in the DNN that do not conflict with each other while still allowing knowledge transfer across them. The CP&S strategy is implemented with different subnetwork selection strategies, revealing superior performance to state-of-the-art continual learning methods tested on various datasets (CIFAR-100, CUB-200-2011, ImageNet-100 and ImageNet-1000). In particular, CP&S is capable of sequentially learning 10 tasks from ImageNet-1000 keeping an accuracy around 94% with negligible forgetting, a first-of-its-kind result in class-incremental learning. To the best of the authors' knowledge, this represents an improvement in accuracy above 10% when compared to the best alternative method.
    AutoML Two-Sample Test. (arXiv:2206.08843v3 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    Towards Interpreting Vulnerability of Multi-Instance Learning via Customized and Universal Adversarial Perturbations. (arXiv:2211.17071v2 [cs.CV] UPDATED)
    Multiple-Instance Learning (MIL) is a recent machine learning paradigm which is immensely useful in various real-life applications, like image analysis, video anomaly detection, text classification, etc. It is well known that most of the existing machine learning classifiers are highly vulnerable to adversarial perturbations. Since MIL is a weakly supervised learning, where information is available for a set of instances, called bag and not for every instances, adversarial perturbations can be fatal. In this paper, we have proposed two adversarial perturbation methods to analyze the effect of adversarial perturbations to interpret the vulnerability of MIL methods. Out of the two algorithms, one can be customized for every bag, and the other is a universal one, which can affect all bags in a given data set and thus has some generalizability. Through simulations, we have also shown the effectiveness of the proposed algorithms to fool the state-of-the-art (SOTA) MIL methods. Finally, we have discussed through experiments, about taking care of these kind of adversarial perturbations through a simple strategy. Source codes are available at https://github.com/InkiInki/MI-UAP.
    NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. (arXiv:2206.04910v3 [cs.LG] UPDATED)
    The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratical complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v3 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to an instance-dependent regret upper bound, roughly improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.
    A Search-Based Testing Approach for Deep Reinforcement Learning Agents. (arXiv:2206.07813v2 [cs.SE] UPDATED)
    Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safety of DRL agents is to test them to detect possible faults leading to critical failures during their execution. This raises the question of how we can efficiently test DRL policies to ensure their correctness and adherence to safety requirements. Most existing works on testing DRL agents use adversarial attacks that perturb states or actions of the agent. However, such attacks often lead to unrealistic states of the environment. Their main goal is to test the robustness of DRL agents rather than testing the compliance of agents' policies with respect to requirements. Due to the huge state space of DRL environments, the high cost of test execution, and the black-box nature of DRL algorithms, the exhaustive testing of DRL agents is impossible. In this paper, we propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent by effectively searching for failing executions of the agent within a limited testing budget. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes. We apply STARLA on Deep-Q-Learning agents which are widely used as benchmarks and show that it significantly outperforms Random Testing by detecting more faults related to the agent's policy. We also investigate how to extract rules that characterize faulty episodes of the DRL agent using our search results. Such rules can be used to understand the conditions under which the agent fails and thus assess its deployment risks.
    Long Range Graph Benchmark. (arXiv:2206.08164v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) that are based on the message passing (MP) paradigm generally exchange information between 1-hop neighbors to build node representations at each layer. In principle, such networks are not able to capture long-range interactions (LRI) that may be desired or necessary for learning a given task on graphs. Recently, there has been an increasing interest in development of Transformer-based methods for graphs that can consider full node connectivity beyond the original sparse structure, thus enabling the modeling of LRI. However, MP-GNNs that simply rely on 1-hop message passing often fare better in several existing graph benchmarks when combined with positional feature representations, among other innovations, hence limiting the perceived utility and ranking of Transformer-like architectures. Here, we present the Long Range Graph Benchmark (LRGB) with 5 graph learning datasets: PascalVOC-SP, COCO-SP, PCQM-Contact, Peptides-func and Peptides-struct that arguably require LRI reasoning to achieve strong performance in a given task. We benchmark both baseline GNNs and Graph Transformer networks to verify that the models which capture long-range dependencies perform significantly better on these tasks. Therefore, these datasets are suitable for benchmarking and exploration of MP-GNNs and Graph Transformer architectures that are intended to capture LRI.
    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. (arXiv:2209.00735v2 [cs.LG] UPDATED)
    Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the NN learns as well as Gaussian elimination, an efficient algorithm that can be succinctly described. Our architecture combines both recurrent weight sharing between layers and convolutional weight sharing to reduce the number of parameters down to a constant, even though the network itself may have trillions of nodes. While in practice the constants in our analysis are too large to be directly meaningful, our work suggests that the synergy of Recurrent and Convolutional NNs (RCNNs) may be more natural and powerful than either alone, particularly for concisely parameterizing discrete algorithms.
    RenyiCL: Contrastive Representation Learning with Skew Renyi Divergence. (arXiv:2208.06270v2 [stat.ML] UPDATED)
    Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.
    The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization. (arXiv:2205.14546v2 [cs.LG] UPDATED)
    Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned label expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM-v1, a practical version of IRM, can fail in various settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, based on conserving the label-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$, which is free of this flaw. Further, we introduce a simplified, practical version of the MRI formulation called MRI-v1. We prove that for general linear problems, MRI-v1 guarantees invariant predictors given sufficient number of environments. We also empirically demonstrate that MRI-v1 strongly out-performs IRM-v1 and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.
    Counterfactual Supervision-based Information Bottleneck for Out-of-Distribution Generalization. (arXiv:2208.07798v3 [cs.LG] UPDATED)
    Learning invariant (causal) features for out-of-distribution (OOD) generalization has attracted extensive attention recently, and among the proposals invariant risk minimization (IRM) is a notable solution. In spite of its theoretical promise for linear regression, the challenges of using IRM in linear classification problems remain. By introducing the information bottleneck (IB) principle into the learning of IRM, IB-IRM approach has demonstrated its power to solve these challenges. In this paper, we further improve IB-IRM from two aspects. First, we show that the key assumption of support overlap of invariant features used in IB-IRM is strong for the guarantee of OOD generalization and it is still possible to achieve the optimal solution without this assumption. Second, we illustrate two failure modes that IB-IRM (and IRM) could fail for learning the invariant features, and to address such failures, we propose a \textit{Counterfactual Supervision-based Information Bottleneck (CSIB)} learning algorithm that provably recovers the invariant features. By requiring counterfactual inference, CSIB works even when accessing data from a single environment. Empirical experiments on several datasets verify our theoretical results.
    Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards. (arXiv:2206.01293v2 [cs.LG] UPDATED)
    Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.
    SMPL: Simulated Industrial Manufacturing and Process Control Learning Environments. (arXiv:2206.08851v2 [cs.LG] UPDATED)
    Traditional biological and pharmaceutical manufacturing plants are controlled by human workers or pre-defined thresholds. Modernized factories have advanced process control algorithms such as model predictive control (MPC). However, there is little exploration of applying deep reinforcement learning to control manufacturing plants. One of the reasons is the lack of high fidelity simulations and standard APIs for benchmarking. To bridge this gap, we develop an easy-to-use library that includes five high-fidelity simulation environments: BeerFMTEnv, ReactorEnv, AtropineEnv, PenSimEnv and mAbEnv, which cover a wide range of manufacturing processes. We build these environments on published dynamics models. Furthermore, we benchmark online and offline, model-based and model-free reinforcement learning algorithms for comparisons of follow-up research.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v4 [cs.LG] UPDATED)
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. The prior GTs are constrained to small graphs with a few hundred nodes, here we propose the first architecture with a complexity linear in the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator on graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We provide a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 16 benchmarks and show highly competitive results in all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    Investigation of a Machine learning methodology for the SKA pulsar search pipeline. (arXiv:2209.04430v3 [astro-ph.IM] UPDATED)
    The SKA pulsar search pipeline will be used for real time detection of pulsars. Modern radio telescopes such as SKA will be generating petabytes of data in their full scale of operation. Hence experience-based and data-driven algorithms become indispensable for applications such as candidate detection. Here we describe our findings from testing a state of the art object detection algorithm called Mask R-CNN to detect candidate signatures in the SKA pulsar search pipeline. We have trained the Mask R-CNN model to detect candidate images. A custom annotation tool was developed to mark the regions of interest in large datasets efficiently. We have successfully demonstrated this algorithm by detecting candidate signatures on a simulation dataset. The paper presents details of this work with a highlight on the future prospects.
    Tighter Regret Analysis and Optimization of Online Federated Learning. (arXiv:2205.06491v3 [cs.LG] UPDATED)
    In federated learning (FL), it is commonly assumed that all data are placed at clients in the beginning of machine learning (ML) optimization (i.e., offline learning). However, in many real-world applications, it is expected to proceed in an online fashion. To this end, online FL (OFL) has been introduced, which aims at learning a sequence of global models from decentralized streaming data such that the so-called cumulative regret is minimized. Combining online gradient descent and model averaging, in this framework, FedOGD is constructed as the counterpart of FedSGD in FL. While it can enjoy an optimal sublinear regret, FedOGD suffers from heavy communication costs. In this paper, we present a communication-efficient method (named OFedIQ) by means of intermittent transmission (enabled by client subsampling and periodic transmission) and quantization. For the first time, we derive the regret bound that captures the impact of data-heterogeneity and the communication-efficient techniques. Through this, we efficiently optimize the parameters of OFedIQ such as sampling rate, transmission period, and quantization levels. Also, it is proved that the optimized OFedIQ can asymptotically achieve the performance of FedOGD while reducing the communication costs by 99%. Via experiments with real datasets, we demonstrate the effectiveness of the optimized OFedIQ.
    Split-kl and PAC-Bayes-split-kl Inequalities for Ternary Random Variables. (arXiv:2206.00706v2 [stat.ML] UPDATED)
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality is particularly well-suited for ternary random variables, which naturally show up in a variety of problems, including analysis of excess losses in classification, analysis of weighted majority votes, and learning with abstention. We demonstrate that for ternary random variables the inequality is simultaneously competitive with the kl inequality, the Empirical Bernstein inequality, and the Unexpected Bernstein inequality, and in certain regimes outperforms all of them. It resolves an open question by Tolstikhin and Seldin [2013] and Mhammedi et al. [2019] on how to match simultaneously the combinatorial power of the kl inequality when the distribution happens to be close to binary and the power of Bernstein inequalities to exploit low variance when the probability mass is concentrated on the middle value. We also derive a PAC-Bayes-split-kl inequality and compare it with the PAC-Bayes-kl, PAC-Bayes-Empirical-Bennett, and PAC-Bayes-Unexpected-Bernstein inequalities in an analysis of excess losses and in an analysis of a weighted majority vote for several UCI datasets. Last but not least, our study provides the first direct comparison of the Empirical Bernstein and Unexpected Bernstein inequalities and their PAC-Bayes extensions.
    Bugs in Machine Learning-based Systems: A Faultload Benchmark. (arXiv:2206.12311v2 [cs.SE] UPDATED)
    The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v5 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v3 [cs.LG] UPDATED)
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.
    TaSIL: Taylor Series Imitation Learning. (arXiv:2205.14812v2 [cs.LG] UPDATED)
    We propose Taylor Series Imitation Learning (TaSIL), a simple augmentation to standard behavior cloning losses in the context of continuous control. TaSIL penalizes deviations in the higher-order Taylor series terms between the learned and expert policies. We show that experts satisfying a notion of $\textit{incremental input-to-state stability}$ are easy to learn, in the sense that a small TaSIL-augmented imitation loss over expert trajectories guarantees a small imitation loss over trajectories generated by the learned policy. We provide sample-complexity bounds for TaSIL that scale as $\tilde{\mathcal{O}}(1/n)$ in the realizable setting, for $n$ the number of expert demonstrations. Finally, we demonstrate experimentally the relationship between the robustness of the expert policy and the order of Taylor expansion required in TaSIL, and compare standard Behavior Cloning, DART, and DAgger with TaSIL-loss-augmented variants. In all cases, we show significant improvement over baselines across a variety of MuJoCo tasks.
    Bayesian Active Learning with Fully Bayesian Gaussian Processes. (arXiv:2205.10186v3 [cs.LG] UPDATED)
    The bias-variance trade-off is a well-known problem in machine learning that only gets more pronounced the less available data there is. In active learning, where labeled data is scarce or difficult to obtain, neglecting this trade-off can cause inefficient and non-optimal querying, leading to unnecessary data labeling. In this paper, we focus on active learning with Gaussian Processes (GPs). For the GP, the bias-variance trade-off is made by optimization of the two hyperparameters: the length scale and noise-term. Considering that the optimal mode of the joint posterior of the hyperparameters is equivalent to the optimal bias-variance trade-off, we approximate this joint posterior and utilize it to design two new acquisition functions. The first one is a Bayesian variant of Query-by-Committee (B-QBC), and the second is an extension that explicitly minimizes the predictive variance through a Query by Mixture of Gaussian Processes (QB-MGP) formulation. Across six simulators, we empirically show that B-QBC, on average, achieves the best marginal likelihood, whereas QB-MGP achieves the best predictive performance. We show that incorporating the bias-variance trade-off in the acquisition functions mitigates unnecessary and expensive data labeling.
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v3 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v5 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v4 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.
    TransBoost: Improving the Best ImageNet Performance using Deep Transduction. (arXiv:2205.13331v4 [cs.CV] UPDATED)
    This paper deals with deep transductive learning, and proposes TransBoost as a procedure for fine-tuning any deep neural model to improve its performance on any (unlabeled) test set provided at training time. TransBoost is inspired by a large margin principle and is efficient and simple to use. Our method significantly improves the ImageNet classification performance on a wide range of architectures, such as ResNets, MobileNetV3-L, EfficientNetB0, ViT-S, and ConvNext-T, leading to state-of-the-art transductive performance. Additionally we show that TransBoost is effective on a wide variety of image classification datasets. The implementation of TransBoost is provided at: https://github.com/omerb01/TransBoost .
    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?. (arXiv:2206.05266v4 [cs.LG] UPDATED)
    We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform evolutionary searches to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. After evaluating these approaches together in multiple different environments including a real-world robot environment, we confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we conduct the ablation study on multiple factors and demonstrate the properties of representations learned with different approaches.
    Fast Bayesian Coresets via Subsampling and Quasi-Newton Refinement. (arXiv:2203.09675v3 [stat.ML] UPDATED)
    Bayesian coresets approximate a posterior distribution by building a small weighted subset of the data points. Any inference procedure that is too computationally expensive to be run on the full posterior can instead be run inexpensively on the coreset, with results that approximate those on the full data. However, current approaches are limited by either a significant run-time or the need for the user to specify a low-cost approximation to the full posterior. We propose a Bayesian coreset construction algorithm that first selects a uniformly random subset of data, and then optimizes the weights using a novel quasi-Newton method. Our algorithm is a simple to implement, black-box method, that does not require the user to specify a low-cost posterior approximation. It is the first to come with a general high-probability bound on the KL divergence of the output coreset posterior. Experiments demonstrate that our method provides significant improvements in coreset quality against alternatives with comparable construction times, with far less storage cost and user input required.
    Curriculum Learning for Goal-Oriented Semantic Communications with a Common Language. (arXiv:2204.10429v2 [cs.NI] UPDATED)
    Goal-oriented semantic communication will be a pillar of next-generation wireless networks. Despite significant recent efforts in this area, most prior works are focused on specific data types (e.g., image or audio), and they ignore the goal and effectiveness aspects of semantic transmissions. In contrast, in this paper, a holistic goal-oriented semantic communication framework is proposed to enable a speaker and a listener to cooperatively execute a set of sequential tasks in a dynamic environment. A common language based on a hierarchical belief set is proposed to enable semantic communications between speaker and listener. The speaker, acting as an observer of the environment, utilizes the beliefs to transmit an initial description of its observation (called event) to the listener. The listener is then able to infer on the transmitted description and complete it by adding related beliefs to the transmitted beliefs of the speaker. As such, the listener reconstructs the observed event based on the completed description, and it then takes appropriate action in the environment based on the reconstructed event. An optimization problem is defined to determine the perfect and abstract description of the events while minimizing the transmission and inference costs with constraints on the task execution time and belief efficiency. Then, a novel bottom-up curriculum learning (CL) framework based on reinforcement learning is proposed to solve the optimization problem and enable the speaker and listener to gradually identify the structure of the belief set and the perfect and abstract description of the events. Simulation results show that the proposed CL method outperforms traditional RL in terms of convergence time, task execution cost and time, reliability, and belief efficiency.
    Learning Neural Acoustic Fields. (arXiv:2204.00628v2 [cs.SD] UPDATED)
    Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary's wide open space. Similarly, as an object moves around us, we expect the sound emitted to also exhibit this movement. While recent advances in learned implicit functions have led to increasingly higher quality representations of the visual world, there have not been commensurate advances in learning spatial auditory representations. To address this gap, we introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene. By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds. We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations. We further show that the representation learned by NAFs can help improve visual learning with sparse views. Finally, we show that a representation informative of scene structure emerges during the learning of NAFs.
    Multi-sensor large-scale dataset for multi-view 3D reconstruction. (arXiv:2203.06111v2 [cs.CV] UPDATED)
    We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. The data for each scene is obtained under a large number of lighting conditions, and the scenes are selected to emphasize a diverse set of material properties challenging for existing algorithms. Overall, we provide around 1.4 million images of 107 different scenes acquired at 14 lighting conditions from 100 viewing directions. We expect our dataset will be useful for evaluation and training of 3D reconstruction algorithms of different types and for other related tasks.
    Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments. (arXiv:2204.03140v2 [cs.RO] UPDATED)
    Autonomous exploration has many important applications. However, classic information gain-based or frontier-based exploration only relies on the robot current state to determine the immediate exploration goal, which lacks the capability of predicting the value of future states and thus leads to inefficient exploration decisions. This paper presents a method to learn how "good" states are, measured by the state value function, to provide a guidance for robot exploration in real-world challenging environments. We formulate our work as a off-policy evaluation (OPE) problem for robot exploration (OPERE). It consists of offline Monte-Carlo training on real-world data and performs Temporal Difference (TD) online adaptation to optimize the trained value estimator. We also design an intrinsic reward function based on sensor information coverage to enable the robot to gain more information with sparse extrinsic rewards. Results demonstrate that our method enables the robot to predict the value of future states so as to better guide robot exploration. The proposed algorithm achieves better prediction performance compared with other state-of-the-art OPE methods. To the best of our knowledge, this work for the first time demonstrates value function prediction on real-world dataset for robot exploration in challenging subterranean and urban environments. More details and demo videos can be found at https://jeffreyyh.github.io/opere/.
    coVariance Neural Networks. (arXiv:2205.15856v4 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Recipes for when Physics Fails: Recovering Robust Learning of Physics Informed Neural Networks. (arXiv:2110.13330v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have been shown to be effective in solving partial differential equations by capturing the physics induced constraints as a part of the training loss function. This paper shows that a PINN can be sensitive to errors in training data and overfit itself in dynamically propagating these errors over the domain of the solution of the PDE. It also shows how physical regularizations based on continuity criteria and conservation laws fail to address this issue and rather introduce problems of their own causing the deep network to converge to a physics-obeying local minimum instead of the global minimum. We introduce Gaussian Process (GP) based smoothing that recovers the performance of a PINN and promises a robust architecture against noise/errors in measurements. Additionally, we illustrate an inexpensive method of quantifying the evolution of uncertainty based on the variance estimation of GPs on boundary data. Robust PINN performance is also shown to be achievable by choice of sparse sets of inducing points based on sparsely induced GPs. We demonstrate the performance of our proposed methods and compare the results from existing benchmark models in literature for time-dependent Schr\"odinger and Burgers' equations.
    Adaptive Composite Online Optimization: Predictions in Static and Dynamic Environments. (arXiv:2205.00446v2 [math.OC] UPDATED)
    In the past few years, Online Convex Optimization (OCO) has received notable attention in the control literature thanks to its flexible real-time nature and powerful performance guarantees. In this paper, we propose new step-size rules and OCO algorithms that simultaneously exploit gradient predictions, function predictions and dynamics, features particularly pertinent to control applications. The proposed algorithms enjoy static and dynamic regret bounds in terms of the dynamics of the reference action sequence, gradient prediction error, and function prediction error, which are generalizations of known regularity measures from the literature. We present results for both convex and strongly convex costs. We validate the performance of the proposed algorithms in a trajectory tracking case study, as well as portfolio optimization using real-world datasets.
    Toward Explainable AI for Regression Models. (arXiv:2112.11407v2 [cs.LG] UPDATED)
    In addition to the impressive predictive power of machine learning (ML) models, more recently, explanation methods have emerged that enable an interpretation of complex non-linear learning models such as deep neural networks. Gaining a better understanding is especially important e.g. for safety-critical ML applications or medical diagnostics etc. While such Explainable AI (XAI) techniques have reached significant popularity for classifiers, so far little attention has been devoted to XAI for regression models (XAIR). In this review, we clarify the fundamental conceptual differences of XAI for regression and classification tasks, establish novel theoretical insights and analysis for XAIR, provide demonstrations of XAIR on genuine practical regression problems, and finally discuss the challenges remaining for the field.
    Generalization Error Bounds for Multiclass Sparse Linear Classifiers. (arXiv:2204.06264v2 [math.ST] UPDATED)
    We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.
    Smoothed Online Combinatorial Optimization Using Imperfect Predictions. (arXiv:2204.10979v2 [cs.LG] UPDATED)
    Smoothed online combinatorial optimization considers a learner who repeatedly chooses a combinatorial decision to minimize an unknown changing cost function with a penalty on switching decisions in consecutive rounds. We study smoothed online combinatorial optimization problems when an imperfect predictive model is available, where the model can forecast the future cost functions with uncertainty. We show that using predictions to plan for a finite time horizon leads to regret dependent on the total predictive uncertainty and an additional switching cost. This observation suggests choosing a suitable planning window to balance between uncertainty and switching cost, which leads to an online algorithm with guarantees on the upper and lower bounds of the cumulative regret. Empirically, our algorithm shows a significant improvement in cumulative regret compared to other baselines in synthetic online distributed streaming problems.
    3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction. (arXiv:2205.14575v2 [cs.CV] UPDATED)
    Recently, the transformer model has been successfully employed for the multi-view 3D reconstruction problem. However, challenges remain on designing an attention mechanism to explore the multiview features and exploit their relations for reinforcing the encoding-decoding modules. This paper proposes a new model, namely 3D coarse-to-fine transformer (3D-C2FT), by introducing a novel coarse-to-fine(C2F) attention mechanism for encoding multi-view features and rectifying defective 3D objects. C2F attention mechanism enables the model to learn multi-view information flow and synthesize 3D surface correction in a coarse to fine-grained manner. The proposed model is evaluated by ShapeNet and Multi-view Real-life datasets. Experimental results show that 3D-C2FT achieves notable results and outperforms several competing models on these datasets.
    Compositional Visual Generation with Composable Diffusion Models. (arXiv:2206.01714v6 [cs.CV] UPDATED)
    Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation. Project page: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/
    Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health. (arXiv:2202.00916v3 [cs.LG] UPDATED)
    This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of decision-focused learning approaches in sequential problems, specifically RMAB problems; (iii) we apply our algorithm to a previously collected dataset of maternal and child health to demonstrate its performance. Indeed, our algorithm is the first for decision-focused learning in RMAB that scales to real-world problem sizes.
    Handling Bias in Toxic Speech Detection: A Survey. (arXiv:2202.00126v3 [cs.SI] UPDATED)
    Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors such as the context, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can thus lead to a sidelining of the various groups they aim to help in the first place. It has piqued researchers' interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this paper, we put together a systematic study of the limitations and challenges of existing methods for mitigating bias in toxicity detection. We look closely at proposed methods for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation. The survey concludes with an overview of the critical challenges, research gaps, and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models.
    A Ranking Game for Imitation Learning. (arXiv:2202.03481v3 [cs.LG] UPDATED)
    We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Project video and code can be found at https://hari-sikchi.github.io/rank-game/
    Dynamic Combination of Heterogeneous Models for Hierarchical Time Series. (arXiv:2112.11669v2 [cs.LG] UPDATED)
    We introduce a framework to dynamically combine heterogeneous models called \texttt{DYCHEM}, which forecasts a set of time series that are related through an aggregation hierarchy. Different types of forecasting models can be employed as individual ``experts'' so that each model is tailored to the nature of the corresponding time series. \texttt{DYCHEM} learns hierarchical structures during the training stage to help generalize better across all the time series being modeled and also mitigates coherency issues that arise due to constraints imposed by the hierarchy. To improve the reliability of forecasts, we construct quantile estimations based on the point forecasts obtained from combined heterogeneous models. The resulting quantile forecasts are coherent and independent of the choice of forecasting models. We conduct a comprehensive evaluation of both point and quantile forecasts for hierarchical time series (HTS), including public data and user records from a large financial software company. In general, our method is robust, adaptive to datasets with different properties, and highly configurable and efficient for large-scale forecasting pipelines.
    $A^{3}D$: A Platform of Searching for Robust Neural Architectures and Efficient Adversarial Attacks. (arXiv:2203.03128v2 [cs.LG] UPDATED)
    The robustness of deep neural networks (DNN) models has attracted increasing attention due to the urgent need for security in many applications. Numerous existing open-sourced tools or platforms are developed to evaluate the robustness of DNN models by ensembling the majority of adversarial attack or defense algorithms. Unfortunately, current platforms do not possess the ability to optimize the architectures of DNN models or the configuration of adversarial attacks to further enhance the robustness of models or the performance of adversarial attacks. To alleviate these problems, in this paper, we first propose a novel platform called auto adversarial attack and defense ($A^{3}D$), which can help search for robust neural network architectures and efficient adversarial attacks. In $A^{3}D$, we employ multiple neural architecture search methods, which consider different robustness evaluation metrics, including four types of noises: adversarial noise, natural noise, system noise, and quantified metrics, resulting in finding robust architectures. Besides, we propose a mathematical model for auto adversarial attack, and provide multiple optimization algorithms to search for efficient adversarial attacks. In addition, we combine auto adversarial attack and defense together to form a unified framework. Among auto adversarial defense, the searched efficient attack can be used as the new robustness evaluation to further enhance the robustness. In auto adversarial attack, the searched robust architectures can be utilized as the threat model to help find stronger adversarial attacks. Experiments on CIFAR10, CIFAR100, and ImageNet datasets demonstrate the feasibility and effectiveness of the proposed platform, which can also provide a benchmark and toolkit for researchers in the application of automated machine learning in evaluating and improving the DNN model robustnesses.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples and Its Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v8 [cs.LG] UPDATED)
    Recent studies have demonstrated the effectiveness of the combination of machine learning and logical reasoning, including data-driven logical reasoning, knowledge driven machine learning and abductive learning, in inventing advanced artificial intelligence technologies. One-step abductive multi-target learning (OSAMTL), an approach inspired by abductive learning, via simply combining machine learning and logical reasoning in a one-step balanced way, has as well shown its effectiveness in handling complex noisy labels of a single noisy sample in medical histopathology whole slide image analysis (MHWSIA). However, OSAMTL is not suitable for the situation where diverse noisy samples (DiNS) are provided for a learning task. In this paper, giving definition of DiNS, we propose one-step abductive multi-target learning with DiNS (OSAMTL-DiNS) to expand the original OSAMTL to handle complex noisy labels of DiNS. Applying OSAMTL-DiNS to tumour segmentation for breast cancer in MHWSIA, we show that OSAMTL-DiNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve more rational predictions.
    Gap Minimization for Knowledge Sharing and Transfer. (arXiv:2201.11231v2 [cs.LG] UPDATED)
    Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of \emph{performance gap}, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g., $\mathcal{H}$-divergence or discrepancy distance), we theoretically show that the performance gap can be viewed as a data- and algorithm-dependent regularizer, which controls the model complexity and leads to finer guarantees. More importantly, it also provides new insights and motivates a novel principle for designing strategies for knowledge sharing and transfer: gap minimization. We instantiate this principle with two algorithms: 1. gapBoost, a novel and principled boosting algorithm that explicitly minimizes the performance gap between source and target domains for transfer learning; and 2. gapMTNN, a representation learning algorithm that reformulates gap minimization as semantic conditional matching for multitask learning. Our extensive evaluation on both transfer learning and multitask learning benchmark data sets shows that our methods outperform existing baselines.
    Forecasting Market Changes using Variational Inference. (arXiv:2205.00605v2 [q-fin.ST] UPDATED)
    Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S\&P change is considered and it is shown that even with as few as three explanatory variables, the proposed approach provides useful predictions.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v3 [cond-mat.stat-mech] UPDATED)
    We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo (NMCMC) simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We discuss several estimates of autocorrelation times in the context of NMCMC, some inspired by analytical results derived for the Metropolized Independent Sampler (MIS). We check their reliability by estimating them on a small system where analytical results can also be obtained. Based on the analytical results for MIS we propose a new loss function and study its impact on the autocorelation times. Although, this function's performance is a bit inferior to the traditional Kullback-Leibler divergence, it offers two training algorithms which in some situations may be beneficial. By studying a small, $4 \times 4$, system we gain access to the dynamics of the training process which we visualize using several observables. Furthermore, we quantitatively investigate the impact of imposing global discrete symmetries of the system in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates which considerably improves the quality of the training. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations for more complicated models.
    Federated Learning with Heterogeneous Differential Privacy. (arXiv:2110.15252v2 [cs.LG] UPDATED)
    Federated learning (FL) takes a first step towards privacy-preserving machine learning by training models while keeping client data local. Models trained using FL may still leak private client information through model updates during training. Differential privacy (DP) may be employed on model updates to provide privacy guarantees within FL, typically at the cost of degraded performance of the final trained model. Both non-private FL and DP-FL can be solved using variants of the federated averaging (FedAvg) algorithm. In this work, we consider a heterogeneous DP setup where clients require varying degrees of privacy guarantees. First, we analyze the optimal solution to the federated linear regression problem with heterogeneous DP in a Bayesian setup. We find that unlike the non-private setup, where the optimal solution for homogeneous data amounts to a single global solution for all clients learned through FedAvg, the optimal solution for each client in this setup would be a personalized one even for homogeneous data. We also analyze the privacy-utility trade-off for this setup, where we characterize the gain obtained from heterogeneous privacy where some clients opt for less strict privacy guarantees. We propose a new algorithm for FL with heterogeneous DP, named FedHDP, which employs personalization and weighted averaging at the server using the privacy choices of clients, to achieve better performance on clients' local models. Through numerical experiments, we show that FedHDP provides up to $9.27\%$ performance gain compared to the baseline DP-FL for the considered datasets where $5\%$ of clients opt out of DP. Additionally, we show a gap in the average performance of local models between non-private and private clients of up to $3.49\%$, empirically illustrating that the baseline DP-FL might incur a large utility cost when not all clients require the stricter privacy guarantees.
    MANDERA: Malicious Node Detection in Federated Learning via Ranking. (arXiv:2110.11736v2 [cs.LG] UPDATED)
    Byzantine attacks hinder the deployment of federated learning algorithms. Although we know that the benign gradients and Byzantine attacked gradients are distributed differently, to detect the malicious gradients is challenging due to (1) the gradient is high-dimensional and each dimension has its unique distribution and (2) the benign gradients and the attacked gradients are always mixed (two-sample test methods cannot apply directly). To address the above, for the first time, we propose MANDERA which is theoretically guaranteed to efficiently detect all malicious gradients under Byzantine attacks with no prior knowledge or history about the number of attacked nodes. More specifically, we transfer the original updating gradient space into a ranking matrix. By such an operation, the scales of different dimensions of the gradients in the ranking space become identical. The high-dimensional benign gradients and the malicious gradients can be easily separated. The effectiveness of MANDERA is further confirmed by experimentation on four Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean), comparing with state-of-the-art defenses. The experiments cover both IID and Non-IID datasets.
    A Deep Reinforcement Learning Approach for Online Parcel Assignment. (arXiv:2109.03467v2 [cs.LG] UPDATED)
    In this paper, we investigate the online parcel assignment (OPA) problem, in which each stochastically generated parcel needs to be assigned to a candidate route for delivery to minimize the total cost subject to certain business constraints. The OPA problem is challenging due to its stochastic nature: each parcel's candidate routes, which depends on the parcel's origin, destination, weight, etc., are unknown until its order is placed, and the total parcel volume is uncertain in advance. To tackle this challenge, we propose the PPO-OPA algorithm based on deep reinforcement learning that shows competitive performance. More specifically, we introduce a novel Markov Decision Process (MDP) framework to model the OPA problem, and develop a policy gradient algorithm that adopts attention networks for policy evaluation. By designing a dedicated reward function, our proposed algorithm can achieve a lower total cost with smaller violation of constraints, comparing to the traditional method which assigns parcels to candidate routes proportionally. In addition, the performances of our proposed algorithm and the Primal-Dual algorithm are comparable, while the later assumes a known total parcel volume in advance, which is unrealistic in practice.
    First-Order Algorithms for Nonlinear Generalized Nash Equilibrium Problems. (arXiv:2204.03132v2 [math.OC] UPDATED)
    We consider the problem of computing an equilibrium in a class of \textit{nonlinear generalized Nash equilibrium problems (NGNEPs)} in which the strategy sets for each player are defined by the equality and inequality constraints that may depend on the choices of rival players. While the asymptotic global convergence and local convergence rate of certain algorithms have been extensively investigated, the iteration complexity analysis is still in its infancy. This paper provides two first-order algorithms based on quadratic penalty method (QPM) and augmented Lagrangian method (ALM), respectively, with an accelerated mirror-prox algorithm as the solver in each inner loop. We show the nonasymptotic convergence rate for these algorithms. In particular, we establish the global convergence guarantee for solving monotone and strongly monotone NGNEPs and provide the complexity bounds expressed in terms of the number of gradient evaluations. Experimental results demonstrate the efficiency of our algorithms in practice.
    GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy. (arXiv:2104.10569v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been demonstrated as a powerful tool for analyzing non-Euclidean graph data. However, the lack of efficient distributed graph learning systems severely hinders applications of GNNs, especially when graphs are big and GNNs are relatively deep. Herein, we present GraphTheta, the first distributed and scalable graph learning system built upon vertex-centric distributed graph processing with neural network operators implemented as user-defined functions. This system supports multiple training strategies and enables efficient and scalable big-graph learning on distributed (virtual) machines with low memory. To facilitate graph convolutions, GraphTheta puts forward a new graph learning abstraction named NN-TGAR to bridge the gap between graph processing and graph deep learning. A distributed graph engine is proposed to conduct the stochastic gradient descent optimization with a hybrid-parallel execution, and a new cluster-batched training strategy is supported. We evaluate GraphTheta using several datasets with network sizes ranging from small-, modest- to large-scale. Experimental results show that GraphTheta can scale well to 1,024 workers for training an in-house developed GNN on an industry-scale Alipay dataset of 1.4 billion nodes and 4.1 billion attributed edges, with a cluster of CPU virtual machines (dockers) of small memory each (5$\sim$12GB). Moreover, GraphTheta can outperform DistDGL by up to $2.02\times$, with better scalability, and GraphLearn by up to $30.56\times$. As for model accuracy, GraphTheta is capable of learning as good GNNs as existing frameworks. To the best of our knowledge, this work presents the largest edge-attributed GNN learning task in the literature.
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v2 [cs.LG] UPDATED)
    We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev- Cantelli inequality (a.k.a. one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive a new concentration of measure inequality, which we name PAC-Bayes-Bennett, since it combines PAC-Bayesian bounding with Bennett's inequality. We use it for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality of Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work of Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.
    The Prominence of Artificial Intelligence in COVID-19. (arXiv:2111.09537v2 [cs.LG] UPDATED)
    In December 2019, a novel virus called COVID-19 had caused an enormous number of causalities to date. The battle with the novel Coronavirus is baffling and horrifying after the Spanish Flu 2019. While the front-line doctors and medical researchers have made significant progress in controlling the spread of the highly contiguous virus, technology has also proved its significance in the battle. Moreover, Artificial Intelligence has been adopted in many medical applications to diagnose many diseases, even baffling experienced doctors. Therefore, this survey paper explores the methodologies proposed that can aid doctors and researchers in early and inexpensive methods of diagnosis of the disease. Most developing countries have difficulties carrying out tests using the conventional manner, but a significant way can be adopted with Machine and Deep Learning. On the other hand, the access to different types of medical images has motivated the researchers. As a result, a mammoth number of techniques are proposed. This paper first details the background knowledge of the conventional methods in the Artificial Intelligence domain. Following that, we gather the commonly used datasets and their use cases to date. In addition, we also show the percentage of researchers adopting Machine Learning over Deep Learning. Thus we provide a thorough analysis of this scenario. Lastly, in the research challenges, we elaborate on the problems faced in COVID-19 research, and we address the issues with our understanding to build a bright and healthy environment.
    SITTA: Single Image Texture Translation for Data Augmentation. (arXiv:2106.13804v2 [cs.CV] UPDATED)
    Recent advances in data augmentation enable one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of image synthesis methods for recognition tasks. In this paper, we propose and explore the problem of image translation for data augmentation. We first propose a lightweight yet efficient model for translating texture to augment images based on a single input of source texture, allowing for fast training and testing, referred to as Single Image Texture Translation for data Augmentation (SITTA). Then we explore the use of augmented data in long-tailed and few-shot image classification tasks. We find the proposed augmentation method and workflow is capable of translating the texture of input data into a target domain, leading to consistently improved image recognition performance. Finally, we examine how SITTA and related image translation methods can provide a basis for a data-efficient, "augmentation engineering" approach to model training. Codes are available at https://github.com/Boyiliee/SITTA.
    Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets. (arXiv:2106.13919v2 [cs.LG] UPDATED)
    This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems, and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the \emph{positive} edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets.
    Genetic algorithm for feature selection of EEG heterogeneous data. (arXiv:2103.07117v2 [cs.NE] UPDATED)
    The electroencephalographic (EEG) signals provide highly informative data on brain activities and functions. However, their heterogeneity and high dimensionality may represent an obstacle for their interpretation. The introduction of a priori knowledge seems the best option to mitigate high dimensionality problems, but could lose some information and patterns present in the data, while data heterogeneity remains an open issue that often makes generalization difficult. In this study, we propose a genetic algorithm (GA) for feature selection that can be used with a supervised or unsupervised approach. Our proposal considers three different fitness functions without relying on expert knowledge. Starting from two publicly available datasets on cognitive workload and motor movement/imagery, the EEG signals are processed, normalized and their features computed in the time, frequency and time-frequency domains. The feature vector selection is performed by applying our GA proposal and compared with two benchmarking techniques. The results show that different combinations of our proposal achieve better results in respect to the benchmark in terms of overall performance and feature reduction. Moreover, the proposed GA, based on a novel fitness function here presented, outperforms the benchmark when the two different datasets considered are merged together, showing the effectiveness of our proposal on heterogeneous data.
    Relay Variational Inference: A Method for Accelerated Encoderless VI. (arXiv:2110.13422v2 [cs.LG] UPDATED)
    Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g. VAE).
    Labels, Information, and Computation: Efficient Learning Using Sufficient Labels. (arXiv:2104.09015v3 [cs.LG] UPDATED)
    In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.
    Random Planted Forest: a directly interpretable tree ensemble. (arXiv:2012.14563v2 [stat.ML] UPDATED)
    We introduce a novel interpretable, tree based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional decomposition perspective, where each tree corresponds to a function within that decomposition. The maximal order of approximation in the decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model places no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in cases where the maximal order of approximation is low. We show that if the order is smaller than three, the idealised version achieves asymptotically optimal convergence rates up to a logarithmic factor. ode is available on https://github.com/PlantedML/randomPlantedForest
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v3 [cs.LG] UPDATED)
    While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v3 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.
    On the Sample Complexity of Stability Constrained Imitation Learning. (arXiv:2102.09161v3 [cs.LG] UPDATED)
    We study the following question in the context of imitation learning for continuous control: how are the underlying stability properties of an expert policy reflected in the sample-complexity of an imitation learning task? We provide the first results showing that a surprisingly granular connection can be made between the underlying expert system's incremental gain stability, a novel measure of robust convergence between pairs of system trajectories, and the dependency on the task horizon $T$ of the resulting generalization bounds. In particular, we propose and analyze incremental gain stability constrained versions of behavior cloning and a DAgger-like algorithm, and show that the resulting sample-complexity bounds naturally reflect the underlying stability properties of the expert system. As a special case, we delineate a class of systems for which the number of trajectories needed to achieve $\varepsilon$-suboptimality is sublinear in the task horizon $T$, and do so without requiring (strong) convexity of the loss function in the policy parameters. Finally, we conduct numerical experiments demonstrating the validity of our insights on both a simple nonlinear system for which the underlying stability properties can be easily tuned, and on a high-dimensional quadrupedal robotic simulation.
    Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. (arXiv:2007.01498v2 [cs.AI] UPDATED)
    In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.
    Sustainable Recreational Fishing Using a Novel Electrical Muscle Stimulation (EMS) Lure and Ensemble Network Algorithm to Maximize Catch and Release Survivability. (arXiv:2006.10125v2 [cs.CV] UPDATED)
    With 200-700 million anglers in the world, sportfishing is nearly five times more common than commercial trawling. Worldwide, hundreds of thousands of jobs are linked to the sportfishing industry, which generates billions of dollars for water-side communities and fisheries conservatories alike. However, the sheer popularity of recreational fishing poses threats to aquatic biodiversity that are hard to regulate. For example, as much as 25% of overfished populations can be traced to anglers. This alarming statistic is explained by the average catch and release mortality rate of 43%, which primarily results from hook-related injuries and careless out-of-water handling. The provisional-patented design proposed in this paper addresses both these problems separately First, a novel, electrical muscle stimulation based fishing lure is proposed as a harmless and low cost alternative to sharp hooks. Early prototypes show a constant electrical current of 90 mA applied through a 200g European perch's jaw can support a reeling tension of 2N - safely within the necessary ranges. Second, a fish-eye camera bob is designed to wirelessly relay underwater footage to a smartphone app, where an ensemble convolutional neural network automatically classifies the fish's species, estimates its length, and cross references with local and state fishing regulations (ie. minimum size, maximum bag limit, and catch season). This capability reduces overfishing by helping anglers avoid accidentally violating guidelines and eliminates the need to reel the fish in and expose it to negligent handling. IN conjunction, this cheap, lightweight, yet high-tech invention is a paradigm shift in preserving a world favorite pastime; while at the same time making recreational fishing more sustainable.
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v3 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.
    Decentralized Exploration in Multi-Armed Bandits -- Extended version. (arXiv:1811.07763v6 [cs.LG] UPDATED)
    We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm Decentralized Elimination, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v5 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search, and also on a real-world YouTube dataset. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
    Approximation Theory of Tree Tensor Networks: Tensorized Univariate Functions -- Part I. (arXiv:2007.00118v4 [math.FA] UPDATED)
    We study the approximation of functions by tensor networks (TNs). We show that Lebesgue $L^p$-spaces in one dimension can be identified with tensor product spaces of arbitrary order through tensorization. We use this tensor product structure to define subsets of $L^p$ of rank-structured functions of finite representation complexity. These subsets are then used to define different approximation classes of tensor networks, associated with different measures of complexity. These approximation classes are shown to be quasi-normed linear spaces. We study some elementary properties and relationships of said spaces. In part II of this work, we will show that classical smoothness (Besov) spaces are continuously embedded into these approximation classes. We will also show that functions in these approximation classes do not possess any Besov smoothness, unless one restricts the depth of the tensor networks. The results of this work are both an analysis of the approximation spaces of TNs and a study of the expressivity of a particular type of neural networks (NN) -- namely feed-forward sum-product networks with sparse architecture. The input variables of this network result from the tensorization step, interpreted as a particular featuring step which can also be implemented with a neural network with a specific architecture. We point out interesting parallels to recent results on the expressivity of rectified linear unit (ReLU) networks -- currently one of the most popular type of NNs.
    Approximation Theory of Tree Tensor Networks: Tensorized Univariate Functions -- Part II. (arXiv:2007.00128v4 [math.FA] UPDATED)
    We study the approximation by tensor networks (TNs) of functions from classical smoothness classes. The considered approximation tool combines a tensorization of functions in $L^p([0,1))$, which allows to identify a univariate function with a multivariate function (or tensor), and the use of tree tensor networks (the tensor train format) for exploiting low-rank structures of multivariate functions. The resulting tool can be interpreted as a feed-forward neural network, with first layers implementing the tensorization, interpreted as a particular featuring step, followed by a sum-product network with sparse architecture. In part I of this work, we presented several approximation classes associated with different measures of complexity of tensor networks and studied their properties. In this work (part II), we show how classical approximation tools, such as polynomials or splines (with fixed or free knots), can be encoded as a tensor network with controlled complexity. We use this to derive direct (Jackson) inequalities for the approximation spaces of tensor networks. This is then utilized to show that Besov spaces are continuously embedded into these approximation spaces. In other words, we show that arbitrary Besov functions can be approximated with optimal or near to optimal rate. We also show that an arbitrary function in the approximation class possesses no Besov smoothness, unless one limits the depth of the tensor network.
    Nonlinear Independent Component Analysis for Discrete-Time and Continuous-Time Signals. (arXiv:2102.02876v3 [stat.ML] UPDATED)
    We study the classical problem of recovering a multidimensional source signal from observations of nonlinear mixtures of this signal. We show that this recovery is possible (up to a permutation and monotone scaling of the source's original component signals) if the mixture is due to a sufficiently differentiable and invertible but otherwise arbitrarily nonlinear function and the component signals of the source are statistically independent with 'non-degenerate' second-order statistics. The latter assumption requires the source signal to meet one of three regularity conditions which essentially ensure that the source is sufficiently far away from the non-recoverable extremes of being deterministic or constant in time. These assumptions, which cover many popular time series models and stochastic processes, allow us to reformulate the initial problem of nonlinear blind source separation as a simple-to-state problem of optimisation-based function approximation. We propose to solve this approximation problem by minimizing a novel type of objective function that efficiently quantifies the mutual statistical dependence between multiple stochastic processes via cumulant-like statistics. This yields a scalable and direct new method for nonlinear Independent Component Analysis with widely applicable theoretical guarantees and for which our experiments indicate good performance.
    Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering. (arXiv:1908.01241v4 [cs.LG] UPDATED)
    Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.
    Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning. (arXiv:1909.05850v6 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.
    Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain--Machine Interfaces. (arXiv:2004.11690v3 [eess.SP] UPDATED)
    Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines by analyzing brain activities recorded with Electroencephalography (EEG). Latency, reliability, and privacy constraints make it unsuitable to offload the computation to the cloud. Practical use cases demand a wearable, battery-operated device with low average power consumption for long-term use. Recently, sophisticated algorithms, in particular deep learning models, have emerged for classifying EEG signals. While reaching outstanding accuracy, these models often exceed the limitations of edge devices due to their memory and computational requirements. In this paper, we demonstrate algorithmic and implementation optimizations for EEGNET, a compact Convolutional Neural Network (CNN) suitable for many BMI paradigms. We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0.4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr.Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster. With our proposed optimization steps, we can obtain an overall speedup of 64x and a reduction of up to 85% in memory footprint with respect to a single-core layer-wise baseline implementation. Our implementation takes only 5.82 ms and consumes 0.627 mJ per inference. With 21.0GMAC/s/W, it is 256x more energy-efficient than an EEGNET implementation on an ARM Cortex-M7 (0.082GMAC/s/W).
    Approximation Theory of Tree Tensor Networks: Tensorized Multivariate Functions. (arXiv:2101.11932v2 [math.FA] UPDATED)
    We study the approximation of multivariate functions with tensor networks (TNs). The main conclusion of this work is an answer to the following two questions: "What are the approximation capabilities of TNs?" and "What is an appropriate model class of functions that can be approximated with TNs?" To answer the former: we show that TNs can (near to) optimally replicate $h$-uniform and $h$-adaptive approximation, for any smoothness order of the target function. Tensor networks thus exhibit universal expressivity w.r.t. isotropic, anisotropic and mixed smoothness spaces that is comparable with more general neural networks families such as deep rectified linear unit (ReLU) networks. Put differently, TNs have the capacity to (near to) optimally approximate many function classes -- without being adapted to the particular class in question. To answer the latter: as a candidate model class we consider approximation classes of TNs and show that these are (quasi-)Banach spaces, that many types of classical smoothness spaces are continuously embedded into said approximation classes and that TN approximation classes are themselves not embedded in any classical smoothness space.
    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. (arXiv:2006.14514v4 [cs.LG] UPDATED)
    Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. In many cases, the gradient of any such loss function has superlinear growth, making the use of the widely-accepted (stochastic) gradient descent methods, which are based on Euler numerical schemes, problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithm's convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in \citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in \citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.
    Learned Lifted Linearization Applied to Unstable Dynamic Systems Enabled by Koopman Direct Encoding. (arXiv:2210.13602v2 [cs.LG] UPDATED)
    This paper presents a Koopman lifting linearization method that is applicable to nonlinear dynamical systems having both stable and unstable regions. It is known that DMD and other standard data-driven methods face a fundamental difficulty in constructing a Koopman model when applied to unstable systems. Here we solve the problem by incorporating knowledge about a nonlinear state equation with a learning method for finding an effective set of observables. In a lifted space, stable and unstable regions are separated into independent subspaces. Based on this property, we propose to find effective observables through neural net training where training data are separated into stable and unstable trajectories. The resultant learned observables are used for constructing a linear state transition matrix using method known as Direct Encoding, which transforms the nonlinear state equation to a state transition matrix through inner product computations with the observables. The proposed method shows a dramatic improvement over existing DMD and data-driven methods.
    Identifying Time Lag in Dynamical Systems with Copula Entropy based Transfer Entropy. (arXiv:2301.06037v1 [cs.LG])
    Time lag between variables is a key characteristics of dynamical systems in different fields and identifying such time lag is a central problem in complex systems with many applications. Transfer Entropy (TE) was proposed as a tool for time lag identification recently. Unfortunately, estimating TE has been a notoriously difficult problem. Copula Entropy (CE) is a measure of statistical independence and it was proved that TE can be represented with only CE. Therefore, a non-parametric estimator of TE based on CE was proposed according to such representation recently. In this paper we propose to use the CE-based estimator of TE to identify time lag in dynamical systems. Both simulated and real data are used to verify the effectiveness of the proposed method in the experiments. Experimental results show that the proposed method can identify the time lags in the three simulated systems. The real data experiment with the data on power consumption of the Tetouan city also demonstrates that our method can identify the pattern of time lags through the estimated TE from the weather factors to the power consumption of the city.
    Pluto's Surface Mapping using Unsupervised Learning from Near-Infrared Observations of LEISA/Ralph. (arXiv:2301.06027v1 [astro-ph.EP])
    We map the surface of Pluto using an unsupervised machine learning technique using the near-infrared observations of the LEISA/Ralph instrument onboard NASA's New Horizons spacecraft. The principal component reduced Gaussian mixture model was implemented to investigate the geographic distribution of the surface units across the dwarf planet. We also present the likelihood of each surface unit at the image pixel level. Average I/F spectra of each unit were analyzed -- in terms of the position and strengths of absorption bands of abundant volatiles such as N${}_{2}$, CH${}_{4}$, and CO and nonvolatile H${}_{2}$O -- to connect the unit to surface composition, geology, and geographic location. The distribution of surface units shows a latitudinal pattern with distinct surface compositions of volatiles -- consistent with the existing literature. However, previous mapping efforts were based primarily on compositional analysis using spectral indices (indicators) or implementation of complex radiative transfer models, which need (prior) expert knowledge, label data, or optical constants of representative endmembers. We prove that an application of unsupervised learning in this instance renders a satisfactory result in mapping the spatial distribution of ice compositions without any prior information or label data. Thus, such an application is specifically advantageous for a planetary surface mapping when label data are poorly constrained or completely unknown, because an understanding of surface material distribution is vital for volatile transport modeling at the planetary scale. We emphasize that the unsupervised learning used in this study has wide applicability and can be expanded to other planetary bodies of the Solar System for mapping surface material distribution.
    Self-recovery of memory via generative replay. (arXiv:2301.06030v1 [cs.NE])
    A remarkable capacity of the brain is its ability to autonomously reorganize memories during offline periods. Memory replay, a mechanism hypothesized to underlie biological offline learning, has inspired offline methods for reducing forgetting in artificial neural networks in continual learning settings. A memory-efficient and neurally-plausible method is generative replay, which achieves state of the art performance on continual learning benchmarks. However, unlike the brain, standard generative replay does not self-reorganize memories when trained offline on its own replay samples. We propose a novel architecture that augments generative replay with an adaptive, brain-like capacity to autonomously recover memories. We demonstrate this capacity of the architecture across several continual learning tasks and environments.
    Semantic and Effective Communication for Remote Control Tasks with Dynamic Feature Compression. (arXiv:2301.05901v1 [cs.LG])
    The coordination of robotic swarms and the remote wireless control of industrial systems are among the major use cases for 5G and beyond systems: in these cases, the massive amounts of sensory information that needs to be shared over the wireless medium can overload even high-capacity connections. Consequently, solving the effective communication problem by optimizing the transmission strategy to discard irrelevant information can provide a significant advantage, but is often a very complex task. In this work, we consider a prototypal system in which an observer must communicate its sensory data to an actor controlling a task (e.g., a mobile robot in a factory). We then model it as a remote Partially Observable Markov Decision Process (POMDP), considering the effect of adopting semantic and effective communication-oriented solutions on the overall system performance. We split the communication problem by considering an ensemble Vector Quantized Variational Autoencoder (VQ-VAE) encoding, and train a Deep Reinforcement Learning (DRL) agent to dynamically adapt the quantization level, considering both the current state of the environment and the memory of past messages. We tested the proposed approach on the well-known CartPole reference control problem, obtaining a significant performance increase over traditional approaches
    An Accurate EEGNet-based Motor-Imagery Brain-Computer Interface for Low-Power Edge Computing. (arXiv:2004.00077v3 [eess.SP] UPDATED)
    This paper presents an accurate and robust embedded motor-imagery brain-computer interface (MI-BCI). The proposed novel model, based on EEGNet, matches the requirements of memory footprint and computational resources of low-power microcontroller units (MCUs), such as the ARM Cortex-M family. Furthermore, the paper presents a set of methods, including temporal downsampling, channel selection, and narrowing of the classification window, to further scale down the model to relax memory requirements with negligible accuracy degradation. Experimental results on the Physionet EEG Motor Movement/Imagery Dataset show that standard EEGNet achieves 82.43%, 75.07%, and 65.07% classification accuracy on 2-, 3-, and 4-class MI tasks in global validation, outperforming the state-of-the-art (SoA) convolutional neural network (CNN) by 2.05%, 5.25%, and 5.48%. Our novel method further scales down the standard EEGNet at a negligible accuracy loss of 0.31% with 7.6x memory footprint reduction and a small accuracy loss of 2.51% with 15x reduction. The scaled models are deployed on a commercial Cortex-M4F MCU taking 101ms and consuming 4.28mJ per inference for operating the smallest model, and on a Cortex-M7 with 44ms and 18.1mJ per inference for the medium-sized model, enabling a fully autonomous, wearable, and accurate low-power BCI.
    Micro and Macro Level Graph Modeling for Graph Variational Auto-Encoders. (arXiv:2210.16844v2 [cs.LG] UPDATED)
    Generative models for graph data are an important research topic in machine learning. Graph data comprise two levels that are typically analyzed separately: node-level properties such as the existence of a link between a pair of nodes, and global aggregate graph-level statistics, such as motif counts. This paper proposes a new multi-level framework that jointly models node-level properties and graph-level statistics, as mutually reinforcing sources of information. We introduce a new micro-macro training objective for graph generation that combines node-level and graph-level losses. We utilize the micro-macro objective to improve graph generation with a GraphVAE, a well-established model based on graph-level latent variables, that provides fast training and generation time for medium-sized graphs. Our experiments show that adding micro-macro modeling to the GraphVAE model improves graph quality scores up to 2 orders of magnitude on five benchmark datasets, while maintaining the GraphVAE generation speed advantage.
    EvoAAA: An evolutionary methodology for automated \neural autoencoder architecture search. (arXiv:2301.06047v1 [cs.NE])
    Machine learning models work better when curated features are provided to them. Feature engineering methods have been usually used as a preprocessing step to obtain or build a proper feature set. In late years, autoencoders (a specific type of symmetrical neural network) have been widely used to perform representation learning, proving their competitiveness against classical feature engineering algorithms. The main obstacle in the use of autoencoders is finding a good architecture, a process that most experts confront manually. An automated autoencoder architecture search procedure, based on evolutionary methods, is proposed in this paper. The methodology is tested against nine heterogeneous data sets. The obtained results show the ability of this approach to find better architectures, able to concentrate most of the useful information in a minimized coding, in a reduced time.
    Margin Optimal Classification Trees. (arXiv:2210.10567v4 [math.OC] UPDATED)
    In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
    A Review on the effectiveness of Dimensional Reduction with Computational Forensics: An Application on Malware Analysis. (arXiv:2301.06031v1 [cs.CR])
    The Android operating system is pervasively adopted as the operating system platform of choice for smart devices. However, the strong adoption has also resulted in exponential growth in the number of Android based malicious software or malware. To deal with such cyber threats as part of cyber investigation and digital forensics, computational techniques in the form of machine learning algorithms are applied for such malware identification, detection and forensics analysis. However, such Computational Forensics modelling techniques are constrained the volume, velocity, variety and veracity of the malware landscape. This in turn would affect its identification and detection effectiveness. Such consequence would inherently induce the question of sustainability with such solution approach. One approach to optimise effectiveness is to apply dimensional reduction techniques like Principal Component Analysis with the intent to enhance algorithmic performance. In this paper, we evaluate the effectiveness of the application of Principle Component Analysis on Computational Forensics task of detecting Android based malware. We applied our research hypothesis to three different datasets with different machine learning algorithms. Our research result showed that the dimensionally reduced dataset would result in a measure of degradation in accuracy performance.
    On the role of Model Uncertainties in Bayesian Optimization. (arXiv:2301.05983v1 [stat.ML])
    Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.
    Sinkhorn Divergences for Unbalanced Optimal Transport. (arXiv:1910.12958v3 [math.OC] UPDATED)
    Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.
    On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning. (arXiv:2301.06010v1 [cs.LG])
    When there are unlabeled Out-Of-Distribution (OOD) data from other classes, Semi-Supervised Learning (SSL) methods suffer from severe performance degradation and even get worse than merely training on labeled data. In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL. PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data according to the model's prediction. We aim to answer two main questions: (1) How do OOD data influence PL? (2) What is the proper usage of OOD data with PL? First, we show that the major problem of PL is imbalanced pseudo-labels on OOD data. Second, we find that OOD data can help classify In-Distribution (ID) data given their OOD ground truth labels. Based on the findings, we propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC). RPL re-balances pseudo-labels of high-confidence data, which simultaneously filters out OOD data and addresses the imbalance problem. SEC uses balanced clustering on low-confidence data to create pseudo-labels on extra classes, simulating the process of training with ground truth. Experiments show that our method achieves steady improvement over supervised baseline and state-of-the-art performance under all class mismatch ratios on different benchmarks.
    Interpretable and Scalable Graphical Models for Complex Spatio-temporal Processes. (arXiv:2301.06021v1 [cs.LG])
    This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.
    Static, dynamic and stability analysis of multi-dimensional functional graded plate with variable thickness using deep neural network. (arXiv:2301.05900v1 [cs.LG])
    The goal of this paper is to analyze and predict the central deflection, natural frequency, and critical buckling load of the multi-directional functionally graded (FG) plate with variable thickness resting on an elastic Winkler foundation. First, the mathematical models of the static and eigenproblems are formulated in great detail. The FG material properties are assumed to vary smoothly and continuously throughout three directions of the plate according to a Mori-Tanaka micromechanics model distribution of volume fraction of constituents. Then, finite element analysis (FEA) with mixed interpolation of tensorial components of 4-nodes (MITC4) is implemented in order to eliminate theoretically a shear locking phenomenon existing. Next, influences of the variable thickness functions (uniform, non-uniform linear, and non-uniform non-linear), material properties, length-to-thickness ratio, boundary conditions, and elastic parameters on the plate response are investigated and discussed in detail through several numerical examples. Finally, a deep neural network (DNN) technique using batch normalization (BN) is learned to predict the non-dimensional values of multi-directional FG plates. The DNN model also shows that it is a powerful technique capable of handling an extensive database and different vital parameters in engineering applications.
    Deep-Reinforcement-Learning-based Path Planning for Industrial Robots using Distance Sensors as Observation. (arXiv:2301.05980v1 [cs.RO])
    Industrial robots are widely used in various manufacturing environments due to their efficiency in doing repetitive tasks such as assembly or welding. A common problem for these applications is to reach a destination without colliding with obstacles or other robot arms. Commonly used sampling-based path planning approaches such as RRT require long computation times, especially in complex environments. Furthermore, the environment in which they are employed needs to be known beforehand. When utilizing the approaches in new environments, a tedious engineering effort in setting hyperparameters needs to be conducted, which is time- and cost-intensive. On the other hand, Deep Reinforcement Learning has shown remarkable results in dealing with unknown environments, generalizing new problem instances, and solving motion planning problems efficiently. On that account, this paper proposes a Deep-Reinforcement-Learning-based motion planner for robotic manipulators. We evaluated our model against state-of-the-art sampling-based planners in several experiments. The results show the superiority of our planner in terms of path length and execution time.
    Recent advances in artificial intelligence for retrosynthesis. (arXiv:2301.05864v1 [cs.LG])
    Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized retrosynthesis. Here we aim to present a comprehensive review of recent advances in AI-based retrosynthesis. For single-step and multi-step retrosynthesis both, we first list their goal and provide a thorough taxonomy of existing methods. Afterwards, we analyze these methods in terms of their mechanism and performance, and introduce popular evaluation metrics for them, in which we also provide a detailed comparison among representative methods on several public datasets. In the next part we introduce popular databases and established platforms for retrosynthesis. Finally, this review concludes with a discussion about promising research directions in this field.
    Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning. (arXiv:2301.05931v1 [cs.LG])
    Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. However, identifying novel drug combinations through wet-lab experiments is resource intensive due to the vast combinatorial search space. Recently, computational approaches, specifically deep learning models have emerged as an efficient way to discover synergistic combinations. While previous methods reported fair performance, their models usually do not take advantage of multi-modal data and they are unable to handle new drugs or cell lines. In this study, we collected data from various datasets covering various drug-related aspects. Then, we take advantage of large-scale pre-training models to generate informative representations and features for drugs, proteins, and diseases. Based on that, a message-passing graph is built on top to propagate information together with graph structure learning flexibility. This is first introduced in the biological networks and enables us to generate pseudo-relations in the graph. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods on synergistic prediction benchmark datasets. We are also capable of inferencing new drug combination data in a test on an independent set released by AstraZeneca, where 10% of improvement over previous methods is observed. In addition, we're robust against unseen drugs and surpass almost 15% AU ROC compared to the second-best model. We believe our framework contributes to both the future wet-lab discovery of novel drugs and the building of promising guidance for precise combination medicine.
    Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint). (arXiv:2301.05965v1 [cs.DB])
    Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers. Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well. In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.
    World Models and Predictive Coding for Cognitive and Developmental Robotics: Frontiers and Challenges. (arXiv:2301.05832v1 [cs.RO])
    Creating autonomous robots that can actively explore the environment, acquire knowledge and learn skills continuously is the ultimate achievement envisioned in cognitive and developmental robotics. Their learning processes should be based on interactions with their physical and social world in the manner of human learning and cognitive development. Based on this context, in this paper, we focus on the two concepts of world models and predictive coding. Recently, world models have attracted renewed attention as a topic of considerable interest in artificial intelligence. Cognitive systems learn world models to better predict future sensory observations and optimize their policies, i.e., controllers. Alternatively, in neuroscience, predictive coding proposes that the brain continuously predicts its inputs and adapts to model its own dynamics and control behavior in its environment. Both ideas may be considered as underpinning the cognitive development of robots and humans capable of continual or lifelong learning. Although many studies have been conducted on predictive coding in cognitive robotics and neurorobotics, the relationship between world model-based approaches in AI and predictive coding in robotics has rarely been discussed. Therefore, in this paper, we clarify the definitions, relationships, and status of current research on these topics, as well as missing pieces of world models and predictive coding in conjunction with crucially related concepts such as the free-energy principle and active inference in the context of cognitive and developmental robotics. Furthermore, we outline the frontiers and challenges involved in world models and predictive coding toward the further integration of AI and robotics, as well as the creation of robots with real cognitive and developmental capabilities in the future.
    Hand Gesture Recognition through Reflected Infrared Light Wave Signals. (arXiv:2301.05955v1 [eess.SP])
    In this study, we present a wireless (non-contact) gesture recognition method using only incoherent light wave signals reflected from a human subject. In comparison to existing radar, light shadow, sound and camera-based sensing systems, this technology uses a low-cost ubiquitous light source (e.g., infrared LED) to send light towards the subject's hand performing gestures and the reflected light is collected by a light sensor (e.g., photodetector). This light wave sensing system recognizes different gestures from the variations of the received light intensity within a 20-35cm range. The hand gesture recognition results demonstrate up to 96% accuracy on average. The developed system can be utilized in numerous Human-computer Interaction (HCI) applications as a low-cost and non-contact gesture recognition technology.
    Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures. (arXiv:2301.05981v1 [cs.LG])
    Traditional reinforcement learning (RL) aims to maximize the expected total reward, while the risk of uncertain outcomes needs to be controlled to ensure reliable performance in a risk-averse setting. In this paper, we consider the problem of maximizing dynamic risk of a sequence of rewards in infinite-horizon Markov Decision Processes (MDPs). We adapt the Expected Conditional Risk Measures (ECRMs) to the infinite-horizon risk-averse MDP and prove its time consistency. Using a convex combination of expectation and conditional value-at-risk (CVaR) as a special one-step conditional risk measure, we reformulate the risk-averse MDP as a risk-neutral counterpart with augmented action space and manipulation on the immediate rewards. We further prove that the related Bellman operator is a contraction mapping, which guarantees the convergence of any value-based RL algorithms. Accordingly, we develop a risk-averse deep Q-learning framework, and our numerical studies based on two simple MDPs show that the risk-averse setting can reduce the variance and enhance robustness of the results.
    Generalized Invariant Matching Property via LASSO. (arXiv:2301.05975v1 [stat.ME])
    Learning under distribution shifts is a challenging task. One principled approach is to exploit the invariance principle via the structural causal models. However, the invariance principle is violated when the response is intervened, making it a difficult setting. In a recent work, the invariant matching property has been developed to shed light on this scenario and shows promising performance. In this work, we generalize the invariant matching property by formulating a high-dimensional problem with intrinsic sparsity. We propose a more robust and computation-efficient algorithm by leveraging a variant of Lasso, improving upon the existing algorithms.
    Adaptive Neural Networks Using Residual Fitting. (arXiv:2301.05744v1 [cs.LG])
    Current methods for estimating the required neural-network size for a given problem class have focused on methods that can be computationally intensive, such as neural-architecture search and pruning. In contrast, methods that add capacity to neural networks as needed may provide similar results to architecture search and pruning, but do not require as much computation to find an appropriate network size. Here, we present a network-growth method that searches for explainable error in the network's residuals and grows the network if sufficient error is detected. We demonstrate this method using examples from classification, imitation learning, and reinforcement learning. Within these tasks, the growing network can often achieve better performance than small networks that do not grow, and similar performance to networks that begin much larger.
    Day-Ahead PV Power Forecasting Based on MSTL-TFT. (arXiv:2301.05911v1 [cs.LG])
    Energy demand is increasing dramatically as global urbanization progresses.Solar energy is a clean energy source with low production and maintenance costs.Accurately predicted PV generation is of great importance for grid integration.Recent day-ahead PV forecasting studies mainly include generation data decomposition, additional meteorological and equipment features, improvement and integration of ANN-based models.We proposed a MSTL-TFT method for day-ahead PV forecasting. The results are better than any of the other studies we have surveyed on day-ahead DKASC PV forecasting.
    Lung airway geometry as an early predictor of autism: A preliminary machine learning-based study. (arXiv:2301.05777v1 [cs.LG])
    The goal of this study is to assess the feasibility of airway geometry as a biomarker for ASD. Chest CT images of children with a documented diagnosis of ASD as well as healthy controls were identified retrospectively. 54 scans were obtained for analysis, including 31 ASD cases and 23 age and sex-matched controls. A feature selection and classification procedure using principal component analysis (PCA) and support vector machine (SVM) achieved a peak cross validation accuracy of nearly 89% using a feature set of 8 airway branching angles. Sensitivity was 94%, but specificity was only 78%. The results suggest a measurable difference in airway branchpoint angles between children with ASD and the control population.
    CrysGNN : Distilling pre-trained knowledge to enhance property prediction for crystalline materials. (arXiv:2301.05852v1 [cs.LG])
    In recent years, graph neural network (GNN) based approaches have emerged as a powerful technique to encode complex topological structure of crystal materials in an enriched representation space. These models are often supervised in nature and using the property-specific training data, learn relationship between crystal structure and different properties like formation energy, bandgap, bulk modulus, etc. Most of these methods require a huge amount of property-tagged data to train the system which may not be available for different properties. However, there is an availability of a huge amount of crystal data with its chemical composition and structural bonds. To leverage these untapped data, this paper presents CrysGNN, a new pre-trained GNN framework for crystalline materials, which captures both node and graph level structural information of crystal graphs using a huge amount of unlabelled material data. Further, we extract distilled knowledge from CrysGNN and inject into different state of the art property predictors to enhance their property prediction accuracy. We conduct extensive experiments to show that with distilled knowledge from the pre-trained model, all the SOTA algorithms are able to outperform their own vanilla version with good margins. We also observe that the distillation process provides a significant improvement over the conventional approach of finetuning the pre-trained model. We have released the pre-trained model along with the large dataset of 800K crystal graph which we carefully curated; so that the pretrained model can be plugged into any existing and upcoming models to enhance their prediction accuracy.
    Discovery of 2D materials using Transformer Network based Generative Design. (arXiv:2301.05824v1 [cond-mat.mtrl-sci])
    Two-dimensional (2D) materials have wide applications in superconductors, quantum, and topological materials. However, their rational design is not well established, and currently less than 6,000 experimentally synthesized 2D materials have been reported. Recently, deep learning, data-mining, and density functional theory (DFT)-based high-throughput calculations are widely performed to discover potential new materials for diverse applications. Here we propose a generative material design pipeline, namely material transformer generator(MTG), for large-scale discovery of hypothetical 2D materials. We train two 2D materials composition generators using self-learning neural language models based on Transformers with and without transfer learning. The models are then used to generate a large number of candidate 2D compositions, which are fed to known 2D materials templates for crystal structure prediction. Next, we performed DFT computations to study their thermodynamic stability based on energy-above-hull and formation energy. We report four new DFT-verified stable 2D materials with zero e-above-hull energies, including NiCl$_4$, IrSBr, CuBr$_3$, and CoBrCl. Our work thus demonstrates the potential of our MTG generative materials design pipeline in the discovery of novel 2D materials and other functional materials.
    A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends. (arXiv:2301.05712v1 [cs.LG])
    Deep supervised learning algorithms generally require large numbers of labeled examples to attain satisfactory performance. To avoid the expensive cost incurred by collecting and labeling too many examples, as a subset of unsupervised learning, self-supervised learning (SSL) was proposed to learn good features from many unlabeled examples without any human-annotated labels. SSL has recently become a hot research topic, and many related algorithms have been proposed. However, few comprehensive studies have explained the connections among different SSL variants and how they have evolved. In this paper, we attempt to provide a review of the various SSL methods from the perspectives of algorithms, theory, applications, three main trends, and open questions. First, the motivations of most SSL algorithms are introduced in detail, and their commonalities and differences are compared. Second, the theoretical issues associated with SSL are investigated. Third, typical applications of SSL in areas such as image processing and computer vision (CV), as well as natural language processing (NLP), are discussed. Finally, the three main trends of SSL and the open research questions are discussed. A collection of useful materials is available at https://github.com/guijiejie/SSL.
    Survey of Knowledge Distillation in Federated Edge Learning. (arXiv:2301.05849v1 [cs.LG])
    The increasing demand for intelligent services and privacy protection of mobile and Internet of Things (IoT) devices motivates the wide application of Federated Edge Learning (FEL), in which devices collaboratively train on-device Machine Learning (ML) models without sharing their private data. \textcolor{black}{Limited by device hardware, diverse user behaviors and network infrastructure, the algorithm design of FEL faces challenges related to resources, personalization and network environments}, and Knowledge Distillation (KD) has been leveraged as an important technique to tackle the above challenges in FEL. In this paper, we investigate the works that KD applies to FEL, discuss the limitations and open problems of existing KD-based FEL approaches, and provide guidance for their real deployment.
    First Three Years of the International Verification of Neural Networks Competition (VNN-COMP). (arXiv:2301.05815v1 [cs.LG])
    This paper presents a summary and meta-analysis of the first three iterations of the annual International Verification of Neural Networks Competition (VNN-COMP) held in 2020, 2021, and 2022. In the VNN-COMP, participants submit software tools that analyze whether given neural networks satisfy specifications describing their input-output behavior. These neural networks and specifications cover a variety of problem classes and tasks, corresponding to safety and robustness properties in image classification, neural control, reinforcement learning, and autonomous systems. We summarize the key processes, rules, and results, present trends observed over the last three years, and provide an outlook into possible future developments.
    A Rigorous Uncertainty-Aware Quantification Framework Is Essential for Reproducible and Replicable Machine Learning Workflows. (arXiv:2301.05763v1 [cs.LG])
    The ability to replicate predictions by machine learning (ML) or artificial intelligence (AI) models and results in scientific workflows that incorporate such ML/AI predictions is driven by numerous factors. An uncertainty-aware metric that can quantitatively assess the reproducibility of quantities of interest (QoI) would contribute to the trustworthiness of results obtained from scientific workflows involving ML/AI models. In this article, we discuss how uncertainty quantification (UQ) in a Bayesian paradigm can provide a general and rigorous framework for quantifying reproducibility for complex scientific workflows. Such as framework has the potential to fill a critical gap that currently exists in ML/AI for scientific workflows, as it will enable researchers to determine the impact of ML/AI model prediction variability on the predictive outcomes of ML/AI-powered workflows. We expect that the envisioned framework will contribute to the design of more reproducible and trustworthy workflows for diverse scientific applications, and ultimately, accelerate scientific discoveries.
    CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence. (arXiv:2301.05872v1 [math.OC])
    In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized SGD for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(nC^3/(1-\lambda_2)^{2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n^3C^6/(1-\lambda_2)^4)$ under smooth nonconvex objective functions, where $(1-\lambda_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.
    Poisoning Attacks and Defenses in Federated Learning: A Survey. (arXiv:2301.05795v1 [cs.CR])
    Federated learning (FL) enables the training of models among distributed clients without compromising the privacy of training datasets, while the invisibility of clients datasets and the training process poses a variety of security threats. This survey provides the taxonomy of poisoning attacks and experimental evaluation to discuss the need for robust FL.
    GAR: Generalized Autoregression for Multi-Fidelity Fusion. (arXiv:2301.05729v1 [stat.ML])
    In many scientific research and engineering applications where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multifidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d 3, where d is the dimensionality of the output. The empirical assessment includes many canonical PDEs and real scientific examples and demonstrates that the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple high-fidelity training samples.
    A Comprehensive Survey of Graph-level Learning. (arXiv:2301.05860v1 [cs.LG])
    Graphs have a superior ability to represent relational data, like chemical compounds, proteins, and social networks. Hence, graph-level learning, which takes a set of graphs as input, has been applied to many tasks including comparison, regression, classification, and more. Traditional approaches to learning a set of graphs tend to rely on hand-crafted features, such as substructures. But while these methods benefit from good interpretability, they often suffer from computational bottlenecks as they cannot skirt the graph isomorphism problem. Conversely, deep learning has helped graph-level learning adapt to the growing scale of graphs by extracting features automatically and decoding graphs into low-dimensional representations. As a result, these deep graph learning methods have been responsible for many successes. Yet, there is no comprehensive survey that reviews graph-level learning starting with traditional learning and moving through to the deep learning approaches. This article fills this gap and frames the representative algorithms into a systematic taxonomy covering traditional learning, graph-level deep neural networks, graph-level graph neural networks, and graph pooling. To ensure a thoroughly comprehensive survey, the evolutions, interactions, and communications between methods from four different branches of development are also examined. This is followed by a brief review of the benchmark data sets, evaluation metrics, and common downstream applications. The survey concludes with 13 future directions of necessary research that will help to overcome the challenges facing this booming field.
    Local Model Explanations and Uncertainty Without Model Access. (arXiv:2301.05761v1 [cs.LG])
    We present a model-agnostic algorithm for generating post-hoc explanations and uncertainty intervals for a machine learning model when only a sample of inputs and outputs from the model is available, rather than direct access to the model itself. This situation may arise when model evaluations are expensive; when privacy, security and bandwidth constraints are imposed; or when there is a need for real-time, on-device explanations. Our algorithm constructs explanations using local polynomial regression and quantifies the uncertainty of the explanations using a bootstrapping approach. Through a simulation study, we show that the uncertainty intervals generated by our algorithm exhibit a favorable trade-off between interval width and coverage probability compared to the naive confidence intervals from classical regression analysis. We further demonstrate the capabilities of our method by applying it to black-box models trained on two real datasets.
    Who Should I Trust: AI or Myself? Leveraging Human and AI Correctness Likelihood to Promote Appropriate Trust in AI-Assisted Decision-Making. (arXiv:2301.05809v1 [cs.HC])
    In AI-assisted decision-making, it is critical for human decision-makers to know when to trust AI and when to trust themselves. However, prior studies calibrated human trust only based on AI confidence indicating AI's correctness likelihood (CL) but ignored humans' CL, hindering optimal team decision-making. To mitigate this gap, we proposed to promote humans' appropriate trust based on the CL of both sides at a task-instance level. We first modeled humans' CL by approximating their decision-making models and computing their potential performance in similar instances. We demonstrated the feasibility and effectiveness of our model via two preliminary studies. Then, we proposed three CL exploitation strategies to calibrate users' trust explicitly/implicitly in the AI-assisted decision-making process. Results from a between-subjects experiment (N=293) showed that our CL exploitation strategies promoted more appropriate human trust in AI, compared with only using AI confidence. We further provided practical implications for more human-compatible AI-assisted decision-making.
    Functional Neural Networks: Shift invariant models for functional data with applications to EEG classification. (arXiv:2301.05869v1 [cs.LG])
    It is desirable for statistical models to detect signals of interest independently of their position. If the data is generated by some smooth process, this additional structure should be taken into account. We introduce a new class of neural networks that are shift invariant and preserve smoothness of the data: functional neural networks (FNNs). For this, we use methods from functional data analysis (FDA) to extend multi-layer perceptrons and convolutional neural networks to functional data. We propose different model architectures, show that the models outperform a benchmark model from FDA in terms of accuracy and successfully use FNNs to classify electroencephalography (EEG) data.
    Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement. (arXiv:2206.13310v3 [eess.AS] UPDATED)
    The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.
    Disentangling representations in Restricted Boltzmann Machines without adversaries. (arXiv:2206.11600v3 [cs.LG] UPDATED)
    A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct information about the data properties (labels). Unfortunately adversarial training is generally difficult to implement in practice. Here we propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated with four examples: the CelebA dataset of facial images, the two-dimensional Ising model, the MNIST dataset of handwritten digits, and the taxonomy of protein families. In addition, we show how our framework allows for analytically computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
    Current Trends in Deep Learning for Earth Observation: An Open-source Benchmark Arena for Image Classification. (arXiv:2207.07189v2 [cs.CV] UPDATED)
    We present AiTLAS: Benchmark Arena -- an open-source benchmark suite for evaluating state-of-the-art deep learning approaches for image classification in Earth Observation (EO). To this end, we present a comprehensive comparative analysis of more than 500 models derived from ten different state-of-the-art architectures and compare them to a variety of multi-class and multi-label classification tasks from 22 datasets with different sizes and properties. In addition to models trained entirely on these datasets, we benchmark models trained in the context of transfer learning, leveraging pre-trained model variants, as it is typically performed in practice. All presented approaches are general and can be easily extended to many other remote sensing image classification tasks not considered in this study. To ensure reproducibility and facilitate better usability and further developments, all of the experimental resources including the trained models, model configurations, and processing details of the datasets (with their corresponding splits used for training and evaluating the models) are publicly available on the repository: https://github.com/biasvariancelabs/aitlas-arena
    Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. (arXiv:2206.07085v3 [cs.LG] UPDATED)
    Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v4 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Counterfactual Fairness with Partially Known Causal Graph. (arXiv:2205.13972v3 [cs.LG] UPDATED)
    Fair machine learning aims to avoid treating individuals or sub-populations unfavourably based on \textit{sensitive attributes}, such as gender and race. Those methods in fair machine learning that are built on causal inference ascertain discrimination and bias through causal effects. Though causality-based fair learning is attracting increasing attention, current methods assume the true causal graph is fully known. This paper proposes a general method to achieve the notion of counterfactual fairness when the true causal graph is unknown. To be able to select features that lead to counterfactual fairness, we derive the conditions and algorithms to identify ancestral relations between variables on a \textit{Partially Directed Acyclic Graph (PDAG)}, specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. Interestingly, we find that counterfactual fairness can be achieved as if the true causal graph were fully known, when specific background knowledge is provided: the sensitive attributes do not have ancestors in the causal graph. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.
    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies. (arXiv:2204.01058v2 [math.PR] UPDATED)
    This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.
    Policy Gradients using Variational Quantum Circuits. (arXiv:2203.10591v3 [quant-ph] UPDATED)
    Variational Quantum Circuits are being used as versatile Quantum Machine Learning models. Some empirical results exhibit an advantage in supervised and generative learning tasks. However, when applied to Reinforcement Learning, less is known. In this work, we considered a Variational Quantum Circuit composed of a low-depth hardware-efficient ansatz as the parameterized policy of a Reinforcement Learning agent. We show that an $\epsilon$-approximation of the policy gradient can be obtained using a logarithmic number of samples concerning the total number of parameters. We empirically verify that such quantum models behave similarly or even outperform typical classical neural networks used in standard benchmarking environments and in quantum control, using only a fraction of the parameters. Moreover, we study the Barren Plateau phenomenon in quantum policy gradients using the Fisher Information Matrix spectrum.
    Sharing to learn and learning to share -- Fitting together Meta-Learning, Multi-Task Learning, and Transfer Learning: A meta review. (arXiv:2111.12146v5 [cs.LG] UPDATED)
    Integrating knowledge across different domains is an essential feature of human learning. Learning paradigms such as transfer learning, meta learning, and multi-task learning reflect the human learning process by exploiting the prior knowledge for new tasks, encouraging faster learning and good generalization for new tasks. This article gives a detailed view of these learning paradigms and their comparative analysis. The weakness of one learning algorithm turns out to be a strength of another, and thus merging them is a prevalent trait in the literature. There are numerous research papers that focus on each of these learning paradigms separately and provide a comprehensive overview of them. However, this article provides a review of research studies that combine (two of) these learning algorithms. This survey describes how these techniques are combined to solve problems in many different fields of study, including computer vision, natural language processing, hyperspectral imaging, and many more, in supervised setting only. As a result, the global generic learning network an amalgamation of meta learning, transfer learning, and multi-task learning is introduced here, along with some open research questions and future research directions in the multi-task setting.
    Privatized Graph Federated Learning. (arXiv:2203.07105v2 [cs.LG] UPDATED)
    Federated learning is a semi-distributed algorithm, where a server communicates with multiple dispersed clients to learn a global model. The federated architecture is not robust and is sensitive to communication and computational overloads due to its one-master multi-client structure. It can also be subject to privacy attacks targeting personal information on the communication links. In this work, we introduce graph federated learning (GFL), which consists of multiple federated units connected by a graph. We then show how graph homomorphic perturbations can be used to ensure the algorithm is differentially private. We conduct both convergence and privacy theoretical analyses and illustrate performance by means of computer simulations.
    Learning Partial Equivariances from Data. (arXiv:2110.10211v3 [cs.CV] UPDATED)
    Group Convolutional Neural Networks (G-CNNs) constrain learned features to respect the symmetries in the selected group, and lead to better generalization when these symmetries appear in the data. If this is not the case, however, equivariance leads to overly constrained models and worse performance. Frequently, transformations occurring in data can be better represented by a subset of a group than by a group as a whole, e.g., rotations in $[-90^{\circ}, 90^{\circ}]$. In such cases, a model that respects equivariance $\textit{partially}$ is better suited to represent the data. In addition, relevant transformations may differ for low and high-level features. For instance, full rotation equivariance is useful to describe edge orientations in a face, but partial rotation equivariance is better suited to describe face poses relative to the camera. In other words, the optimal level of equivariance may differ per layer. In this work, we introduce $\textit{Partial G-CNNs}$: G-CNNs able to learn layer-wise levels of partial and full equivariance to discrete, continuous groups and combinations thereof as part of training. Partial G-CNNs retain full equivariance when beneficial, e.g., for rotated MNIST, but adjust it whenever it becomes harmful, e.g., for classification of 6 / 9 digits or natural images. We empirically show that partial G-CNNs pair G-CNNs when full equivariance is advantageous, and outperform them otherwise.
    Variational Actor-Critic Algorithms. (arXiv:2108.01215v4 [cs.LG] UPDATED)
    We introduce a class of variational actor-critic algorithms based on a variational formulation over both the value function and the policy. The objective function of the variational formulation consists of two parts: one for maximizing the value function and the other for minimizing the Bellman residual. Besides the vanilla gradient descent with both the value function and the policy updates, we propose two variants, the clipping method and the flipping method, in order to speed up the convergence. We also prove that, when the prefactor of the Bellman residual is sufficiently large, the fixed point of the algorithm is close to the optimal policy.
    Smart Choices and the Selection Monad. (arXiv:2007.08926v7 [cs.LO] UPDATED)
    Describing systems in terms of choices and their resulting costs and rewards offers the promise of freeing algorithm designers and programmers from specifying how those choices should be made; in implementations, the choices can be realized by optimization techniques and, increasingly, by machine-learning methods. We study this approach from a programming-language perspective. We define two small languages that support decision-making abstractions: one with choices and rewards, and the other additionally with probabilities. We give both operational and denotational semantics. In the case of the second language we consider three denotational semantics, with varying degrees of correlation between possible program values and expected rewards. The operational semantics combine the usual semantics of standard constructs with optimization over spaces of possible execution strategies. The denotational semantics, which are compositional, rely on the selection monad, to handle choice, augmented with an auxiliary monad to handle other effects, such as rewards or probability. We establish adequacy theorems that the two semantics coincide in all cases. We also prove full abstraction at base types, with varying notions of observation in the probabilistic case corresponding to the various degrees of correlation. We present axioms for choice combined with rewards and probability, establishing completeness at base types for the case of rewards without probability.
    Elastic Similarity and Distance Measures for Multivariate Time Series. (arXiv:2102.10231v2 [cs.LG] UPDATED)
    This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures are a class of similarity measures that can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.
    SRECG: ECG Signal Super-resolution Framework for Portable/Wearable Devices in Cardiac Arrhythmias Classification. (arXiv:2012.03803v2 [eess.SP] UPDATED)
    A combination of cloud-based deep learning (DL) algorithms with portable/wearable (P/W) devices has been developed as a smart heath care system to support automatic cardiac arrhythmias (CAs) classification using electrocardiography (ECG). However, long-term and continuous ECG monitoring is challenging because of limitations of batteries and transmission bandwidth of P/W devices while incorporated with consumer electronics (CE). A feasible approach to address this challenge is to decrease sampling rates. However, low sampling rates lead to low-resolution signals that hinder the CAs classification performance. In this study, we propose a DL-based ECG signal super-resolution framework (called SRECG) to enhance low-resolution ECG signals by jointly considering the accuracies when applied to the DL-based high-resolution multiclass classifier (HMC) of CAs. In our experiments, we downsampled the ECG signals from the CPSC2018 dataset and evaluated their HMC accuracies with and without the SRECG. Experimental results show that SRECG can well improve the HMC accuracies as compared to traditional interpolation methods. Moreover, approximately half of the CAs classification accuracies of HMC were maintained within the enhanced ECG signals by SRECG. The promising results confirm that SRECG can be suitably used to enhance low-resolution ECG signals from P/W devices with CE to improve their cloud-based HMC performances.
    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection. (arXiv:2006.14563v2 [cs.CV] UPDATED)
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at $\href{https://github.com/asvspoof/D3M}{\text{https://github.com/asvspoof/D3M}}$.
    Efficient anomaly detection method for rooftop PV systems using big data and permutation entropy. (arXiv:2301.06035v1 [cs.LG])
    The number of rooftop photovoltaic (PV) systems has significantly increased in recent years around the globe, including in Australia. This trend is anticipated to continue in the next few years. Given their high share of generation in power systems, detecting malfunctions and abnormalities in rooftop PV systems is essential for ensuring their high efficiency and safety. In this paper, we present a novel anomaly detection method for a large number of rooftop PV systems installed in a region using big data and a time series complexity measure called weighted permutation entropy (WPE). This efficient method only uses the historical PV generation data in a given region to identify anomalous PV systems and requires no new sensor or smart device. Using a real-world PV generation dataset, we discuss how the hyperparameters of WPE should be tuned for the purpose. The proposed PV anomaly detection method is then tested on rooftop PV generation data from over 100 South Australian households. The results demonstrate that anomalous systems detected by our method have indeed encountered problems and require a close inspection. The detection and resolution of potential faults would result in better rooftop PV systems, longer lifetimes, and higher returns on investment.
    Hawk: An Industrial-strength Multi-label Document Classifier. (arXiv:2301.06057v1 [cs.CL])
    There are a plethora of methods and algorithms that solve the classical multi-label document classification. However, when it comes to deployment and usage in an industry setting, most, if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i. ability to operate on variable-length texts and rambling documents. ii. catastrophic forgetting problem. iii. modularity when it comes to online learning and updating the model. iv. ability to spotlight relevant text while producing the prediction, i.e. visualizing the predictions. v. ability to operate on imbalanced or skewed datasets. vi. scalability. The paper describes the significance of these problems in detail and proposes a unique neural network architecture that addresses the above problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation. A hydranet-like architecture is designed to have granular control over and improve the modularity, coupled with a weighted loss driving task-specific heads. In particular, two specific mechanisms are compared: Bi-LSTM and Transformer-based. The architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model outperforms the existing methods by a substantial margin. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet.
    Deep Learning Provides Rapid Screen for Breast Cancer Metastasis with Sentinel Lymph Nodes. (arXiv:2301.05938v1 [cs.CV])
    Deep learning has been shown to be useful to detect breast cancer metastases by analyzing whole slide images of sentinel lymph nodes. However, it requires extensive scanning and analysis of all the lymph nodes slides for each case. Our deep learning study focuses on breast cancer screening with only a small set of image patches from any sentinel lymph node, positive or negative for metastasis, to detect changes in tumor environment and not in the tumor itself. We design a convolutional neural network in the Python language to build a diagnostic model for this purpose. The excellent results from this preliminary study provided a proof of concept for incorporating automated metastatic screen into the digital pathology workflow to augment the pathologists' productivity. Our approach is unique since it provides a very rapid screen rather than an exhaustive search for tumor in all fields of all sentinel lymph nodes.
    Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks. (arXiv:1906.04893v2 [cs.LG] UPDATED)
    Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.
    Compress Then Test: Powerful Kernel Testing in Near-linear Time. (arXiv:2301.05974v1 [stat.ML])
    Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
    A data science and machine learning approach to continuous analysis of Shakespeare's plays. (arXiv:2301.06024v1 [cs.CL])
    The availability of quantitative methods that can analyze text has provided new ways of examining literature in a manner that was not available in the pre-information era. Here we apply comprehensive machine learning analysis to the work of William Shakespeare. The analysis shows clear change in style of writing over time, with the most significant changes in the sentence length, frequency of adjectives and adverbs, and the sentiments expressed in the text. Applying machine learning to make a stylometric prediction of the year of the play shows a Pearson correlation of 0.71 between the actual and predicted year, indicating that Shakespeare's writing style as reflected by the quantitative measurements changed over time. Additionally, it shows that the stylometrics of some of the plays is more similar to plays written either before or after the year they were written. For instance, Romeo and Juliet is dated 1596, but is more similar in stylometrics to plays written by Shakespeare after 1600. The source code for the analysis is available for free download.
    Transferring Fairness under Distribution Shifts via Fair Consistency Regularization. (arXiv:2206.12796v3 [cs.LG] UPDATED)
    The increasing reliance on ML models in high-stakes tasks has raised a major concern on fairness violations. Although there has been a surge of work that improves algorithmic fairness, most of them are under the assumption of an identical training and test distribution. In many real-world applications, however, such an assumption is often violated as previously trained fair models are often deployed in a different environment, and the fairness of such models has been observed to collapse. In this paper, we study how to transfer model fairness under distribution shifts, a widespread issue in practice. We conduct a fine-grained analysis of how the fair model is affected under different types of distribution shifts and find that domain shifts are more challenging than subpopulation shifts. Inspired by the success of self-training in transferring accuracy under domain shifts, we derive a sufficient condition for transferring group fairness. Guided by it, we propose a practical algorithm with a fair consistency regularization as the key component. A synthetic dataset benchmark, which covers all types of distribution shifts, is deployed for experimental verification of the theoretical findings. Experiments on synthetic and real datasets including image and tabular data demonstrate that our approach effectively transfers fairness and accuracy under various distribution shifts.
    Evaluating the Spectral Bias of Coordinate Based MLPs. (arXiv:2301.05816v1 [cs.LG])
    In recent years, representations given by fully connected neural networks have shown to represent scenes, objects, and other measurements well in dense low-dimensional settings. For these models, termed coordinate based MLPs, sinusoidal encodings are necessary in allowing for convergence to the high frequency components of the target function. This requirement is a result of their severe spectral bias when using dense, low dimensional coordinate based inputs. Previous work explained this phenomena using Neural Tangent Kernel (NTK) and Fourier analysis. While these methods provide insight towards this large spectral bias and the benefits of positional encoding, the properties of ReLU networks that induce this behavior are not fully determined. Analyzing spectral bias directly through the computations of ReLU networks would expose their limitations in dense settings, while providing a clearer explanation as to how this behavior emerges during the learning process. In this paper, we systematically analyze the spectral bias of a coordinate based MLP through its activation regions and gradient descent dynamics. This allows us to relate the network's expressive capacity to the speed at which gradient descent converges for components of varying frequency, and how the density of the data further restricts the model.
    What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. (arXiv:2208.01066v2 [cs.CL] UPDATED)
    In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .
    Reinforcement learning on graphs: A survey. (arXiv:2204.06127v4 [cs.LG] UPDATED)
    Graph mining tasks arise from many different application domains, ranging from social networks, transportation to E-commerce, etc., which have been receiving great attention from the theoretical and algorithmic design communities in recent years, and there has been some pioneering work employing the research-rich Reinforcement Learning (RL) techniques to address graph data mining tasks. However, these graph mining methods and RL models are dispersed in different research areas, which makes it hard to compare them. In this survey, we provide a comprehensive overview of RL and graph mining methods and generalize these methods to Graph Reinforcement Learning (GRL) as a unified formulation. We further discuss the applications of GRL methods across various domains and summarize the method descriptions, open-source codes, and benchmark datasets of GRL methods. Furthermore, we propose important directions and challenges to be solved in the future. As far as we know, this is the latest work on a comprehensive survey of GRL, this work provides a global view and a learning resource for scholars. In addition, we create an online open-source for both interested scholars who want to enter this rapidly developing domain and experts who would like to compare GRL methods.
    K-Deep Simplex: Deep Manifold Learning via Local Dictionaries. (arXiv:2012.02134v3 [cs.LG] UPDATED)
    We propose K-Deep Simplex (KDS) which, given a set of data points, learns a dictionary comprising synthetic landmarks, along with representation coefficients supported on a simplex. KDS integrates manifold learning and sparse coding/dictionary learning: reconstruction term, as in classical dictionary learning, and a novel local weighted $\ell_1$ penalty that encourages each data point to represent itself as a convex combination of nearby landmarks. We solve the proposed optimization program using alternating minimization and design an efficient, interpretable autoencoder using algorithm enrolling. We theoretically analyze the proposed program by relating the weighted $\ell_1$ penalty in KDS to a weighted $\ell_0$ program. Assuming that the data are generated from a Delaunay triangulation, we prove the equivalence of the weighted $\ell_1$ and weighted $\ell_0$ programs. If the representation coefficients are given, we prove that the resulting dictionary is unique. Further, we show that low-dimensional representations can be efficiently obtained from the covariance of the coefficient matrix. We apply KDS to the unsupervised clustering problem and prove theoretical performance guarantees. Experiments show that the algorithm is highly efficient and performs competitively on synthetic and real data sets.
    Salient Sign Detection In Safe Autonomous Driving: AI Which Reasons Over Full Visual Context. (arXiv:2301.05804v1 [cs.CV])
    Detecting road traffic signs and accurately determining how they can affect the driver's future actions is a critical task for safe autonomous driving systems. However, various traffic signs in a driving scene have an unequal impact on the driver's decisions, making detecting the salient traffic signs a more important task. Our research addresses this issue, constructing a traffic sign detection model which emphasizes performance on salient signs, or signs that influence the decisions of a driver. We define a traffic sign salience property and use it to construct the LAVA Salient Signs Dataset, the first traffic sign dataset that includes an annotated salience property. Next, we use a custom salience loss function, Salience-Sensitive Focal Loss, to train a Deformable DETR object detection model in order to emphasize stronger performance on salient signs. Results show that a model trained with Salience-Sensitive Focal Loss outperforms a model trained without, with regards to recall of both salient signs and all signs combined. Further, the performance margin on salient signs compared to all signs is largest for the model trained with Salience-Sensitive Focal Loss.  ( 2 min )
    MLOps: A Primer for Policymakers on a New Frontier in Machine Learning. (arXiv:2301.05775v1 [cs.LG])
    This chapter is written with the Data Scientist or MLOps professional in mind but can be used as a resource for policy makers, reformists, AI Ethicists, sociologists, and others interested in finding methods that help reduce bias in algorithms. I will take a deployment centered approach with the assumption that the professionals reading this work have already read the amazing work on the implications of algorithms on historically marginalized groups by Gebru, Buolamwini, Benjamin and Shane to name a few. If you have not read those works, I refer you to the "Important Reading for Ethical Model Building" list at the end of this paper as it will help give you a framework on how to think about Machine Learning models more holistically taking into account their effect on marginalized people. In the Introduction to this chapter, I root the significance of their work in real world examples of what happens when models are deployed without transparent data collected for the training process and are deployed without the practitioners paying special attention to what happens to models that adapt to exploit gaps between their training environment and the real world. The rest of this chapter builds on the work of the aforementioned researchers and discusses the reality of models performing post production and details ways ML practitioners can identify bias using tools during the MLOps lifecycle to mitigate bias that may be introduced to models in the real world.  ( 2 min )
    ML Approach for Power Consumption Prediction in Virtualized Base Stations. (arXiv:2301.05764v1 [cs.LG])
    The flexibility introduced with the Open Radio Access Network (O-RAN) architecture allows us to think beyond static configurations in all parts of the network. This paper addresses the issue related to predicting the power consumption of different radio schedulers, and the potential offered by O-RAN to collect data, train models, and deploy policies to control the power consumption. We propose a black-box (Neural Network) model to learn the power consumption function. We compare our approach with a known hand-crafted solution based on domain knowledge. Our solution reaches similar performance without any previous knowledge of the application and provides more flexibility in scenarios where the system behavior is not well understood or the domain knowledge is not available.  ( 2 min )
    FedSSC: Shared Supervised-Contrastive Federated Learning. (arXiv:2301.05797v1 [cs.LG])
    Federated learning is widely used to perform decentralized training of a global model on multiple devices while preserving the data privacy of each device. However, it suffers from heterogeneous local data on each training device which increases the difficulty to reach the same level of accuracy as the centralized training. Supervised Contrastive Learning which outperform cross-entropy tries to minimizes the difference between feature space of points belongs to the same class and pushes away points from different classes. We propose Supervised Contrastive Federated Learning in which devices can share the learned class-wise feature spaces with each other and add the supervised-contrastive learning loss as a regularization term to foster the feature space learning. The loss tries to minimize the cosine similarity distance between the feature map and the averaged feature map from another device in the same class and maximizes the distance between the feature map and that in a different class. This new regularization term when added on top of the moon regularization term is found to outperform the other state-of-the-art regularization terms in solving the heterogeneous data distribution problem.  ( 2 min )
    A domain-decomposed VAE method for Bayesian inverse problems. (arXiv:2301.05708v1 [stat.ML])
    Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Eco-PiNN: A Physics-informed Neural Network for Eco-toll Estimation. (arXiv:2301.05739v1 [cs.LG])
    The eco-toll estimation problem quantifies the expected environmental cost (e.g., energy consumption, exhaust emissions) for a vehicle to travel along a path. This problem is important for societal applications such as eco-routing, which aims to find paths with the lowest exhaust emissions or energy need. The challenges of this problem are three-fold: (1) the dependence of a vehicle's eco-toll on its physical parameters; (2) the lack of access to data with eco-toll information; and (3) the influence of contextual information (i.e. the connections of adjacent segments in the path) on the eco-toll of road segments. Prior work on eco-toll estimation has mostly relied on pure data-driven approaches and has high estimation errors given the limited training data. To address these limitations, we propose a novel Eco-toll estimation Physics-informed Neural Network framework (Eco-PiNN) using three novel ideas, namely, (1) a physics-informed decoder that integrates the physical laws of the vehicle engine into the network, (2) an attention-based contextual information encoder, and (3) a physics-informed regularization to reduce overfitting. Experiments on real-world heavy-duty truck data show that the proposed method can greatly improve the accuracy of eco-toll estimation compared with state-of-the-art methods.  ( 2 min )
    Diatom-inspired architected materials using language-based deep learning: Perception, transformation and manufacturing. (arXiv:2301.05875v1 [cond-mat.mtrl-sci])
    Learning from nature has been a quest of humanity for millennia. While this has taken the form of humans assessing natural designs such as bones, butterfly wings, or spider webs, we can now achieve generating designs using advanced computational algorithms. In this paper we report novel biologically inspired designs of diatom structures, enabled using transformer neural networks, using natural language models to learn, process and transfer insights across manifestations. We illustrate a series of novel diatom-based designs and also report a manufactured specimen, created using additive manufacturing. The method applied here could be expanded to focus on other biological design cues, implement a systematic optimization to meet certain design targets, and include a hybrid set of material design sets.  ( 2 min )
    Artificial Benchmark for Community Detection with Outliers (ABCD+o). (arXiv:2301.05749v1 [cs.SI])
    The Artificial Benchmark for Community Detection graph (ABCD) is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs with similar properties as the well-known LFR one, and its main parameter $\xi$ can be tuned to mimic its counterpart in the LFR model, the mixing parameter $\mu$. In this paper, we extend the ABCD model to include potential outliers. We perform some exploratory experiments on both the new ABCD+o model as well as a real-world network to show that outliers possess some desired, distinguishable properties.  ( 2 min )
    Efficient Activation Function Optimization through Surrogate Modeling. (arXiv:2301.05785v1 [cs.LG])
    Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model's predictive distribution at initialization and the activation function's output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in CIFAR-100 and ImageNet tasks. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization. Code is available at https://github.com/cognizant-ai-labs/aquasurf, and the benchmark datasets are at https://github.com/cognizant-ai-labs/act-bench.  ( 2 min )
    Fairness and Sequential Decision Making: Limits, Lessons, and Opportunities. (arXiv:2301.05753v1 [cs.CY])
    As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated. However, various research communities have independently conceptualized these harms, envisioned potential applications, and proposed interventions. The result is a somewhat fractured landscape of literature focused generally on ensuring decision-making algorithms "do the right thing". In this paper, we compare and discuss work across two major subsets of this literature: algorithmic fairness, which focuses primarily on predictive systems, and ethical decision making, which focuses primarily on sequential decision making and planning. We explore how each of these settings has articulated its normative concerns, the viability of different techniques for these different settings, and how ideas from each setting may have utility for the other.  ( 2 min )
  • Open

    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v5 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Unbalanced Optimal Transport, from Theory to Numerics. (arXiv:2211.08775v2 [stat.ML] UPDATED)
    Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.  ( 2 min )
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v2 [math.OC] UPDATED)
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the nonconvex minimax problems with coupled linear constraints.  ( 2 min )
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.  ( 2 min )
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v2 [math.OC] UPDATED)
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.  ( 2 min )
    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. (arXiv:2209.00735v2 [cs.LG] UPDATED)
    Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the NN learns as well as Gaussian elimination, an efficient algorithm that can be succinctly described. Our architecture combines both recurrent weight sharing between layers and convolutional weight sharing to reduce the number of parameters down to a constant, even though the network itself may have trillions of nodes. While in practice the constants in our analysis are too large to be directly meaningful, our work suggests that the synergy of Recurrent and Convolutional NNs (RCNNs) may be more natural and powerful than either alone, particularly for concisely parameterizing discrete algorithms.  ( 2 min )
    Failure-informed adaptive sampling for PINNs. (arXiv:2210.00279v3 [math.NA] UPDATED)
    Physics-informed neural networks (PINNs) have emerged as an effective technique for solving PDEs in a wide range of domains. It is noticed, however, the performance of PINNs can vary dramatically with different sampling procedures. For instance, a fixed set of (prior chosen) training points may fail to capture the effective solution region (especially for problems with singularities). To overcome this issue, we present in this work an adaptive strategy, termed the failure-informed PINNs (FI-PINNs), which is inspired by the viewpoint of reliability analysis. The key idea is to define an effective failure probability based on the residual, and then, with the aim of placing more samples in the failure region, the FI-PINNs employs a failure-informed enrichment technique to adaptively add new collocation points to the training set, such that the numerical accuracy is dramatically improved. In short, similar as adaptive finite element methods, the proposed FI-PINNs adopts the failure probability as the posterior error indicator to generate new training points. We prove rigorous error bounds of FI-PINNs and illustrate its performance through several problems.  ( 2 min )
    Nonlinear Independent Component Analysis for Discrete-Time and Continuous-Time Signals. (arXiv:2102.02876v3 [stat.ML] UPDATED)
    We study the classical problem of recovering a multidimensional source signal from observations of nonlinear mixtures of this signal. We show that this recovery is possible (up to a permutation and monotone scaling of the source's original component signals) if the mixture is due to a sufficiently differentiable and invertible but otherwise arbitrarily nonlinear function and the component signals of the source are statistically independent with 'non-degenerate' second-order statistics. The latter assumption requires the source signal to meet one of three regularity conditions which essentially ensure that the source is sufficiently far away from the non-recoverable extremes of being deterministic or constant in time. These assumptions, which cover many popular time series models and stochastic processes, allow us to reformulate the initial problem of nonlinear blind source separation as a simple-to-state problem of optimisation-based function approximation. We propose to solve this approximation problem by minimizing a novel type of objective function that efficiently quantifies the mutual statistical dependence between multiple stochastic processes via cumulant-like statistics. This yields a scalable and direct new method for nonlinear Independent Component Analysis with widely applicable theoretical guarantees and for which our experiments indicate good performance.  ( 2 min )
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v2 [cs.LG] UPDATED)
    We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev- Cantelli inequality (a.k.a. one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive a new concentration of measure inequality, which we name PAC-Bayes-Bennett, since it combines PAC-Bayesian bounding with Bennett's inequality. We use it for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality of Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work of Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.  ( 2 min )
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v3 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection. (arXiv:2006.14563v2 [cs.CV] UPDATED)
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at $\href{https://github.com/asvspoof/D3M}{\text{https://github.com/asvspoof/D3M}}$.  ( 3 min )
    Elastic Similarity and Distance Measures for Multivariate Time Series. (arXiv:2102.10231v2 [cs.LG] UPDATED)
    This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures are a class of similarity measures that can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.  ( 2 min )
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v2 [cs.CV] UPDATED)
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.  ( 2 min )
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v3 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to an instance-dependent regret upper bound, roughly improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.  ( 2 min )
    Generic Error Bounds for the Generalized Lasso with Sub-Exponential Data. (arXiv:2004.05361v3 [math.ST] UPDATED)
    This work performs a non-asymptotic analysis of the generalized Lasso under the assumption of sub-exponential data. Our main results continue recent research on the benchmark case of (sub-)Gaussian sample distributions and thereby explore what conclusions are still valid when going beyond. While many statistical features remain unaffected (e.g., consistency and error decay rates), the key difference becomes manifested in how the complexity of the hypothesis set is measured. It turns out that the estimation error can be controlled by means of two complexity parameters that arise naturally from a generic-chaining-based proof strategy. The output model can be non-realizable, while the only requirement for the input vector is a generic concentration inequality of Bernstein-type, which can be implemented for a variety of sub-exponential distributions. This abstract approach allows us to reproduce, unify, and extend previously known guarantees for the generalized Lasso. In particular, we present applications to semi-parametric output models and phase retrieval via the lifted Lasso. Moreover, our findings are discussed in the context of sparse recovery and high-dimensional estimation problems.  ( 2 min )
    Black-box Coreset Variational Inference. (arXiv:2211.02377v2 [stat.ML] UPDATED)
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.  ( 2 min )
    On the Exactness of Dantzig-Wolfe Relaxation for Rank Constrained Optimization Problems. (arXiv:2210.16191v2 [math.OC] UPDATED)
    In the rank-constrained optimization problem (RCOP), it minimizes a linear objective function over a prespecified closed rank-constrained domain set and $m$ generic two-sided linear matrix inequalities. Motivated by the Dantzig-Wolfe (DW) decomposition, a popular approach of solving many nonconvex optimization problems, we investigate the strength of DW relaxation (DWR) of the RCOP, which admits the same formulation as RCOP except replacing the domain set by its closed convex hull. Notably, our goal is to characterize conditions under which the DWR matches RCOP for any m two-sided linear matrix inequalities. From the primal perspective, we develop the first-known simultaneously necessary and sufficient conditions that achieve: (i) extreme point exactness -- all the extreme points of the DWR feasible set belong to that of the RCOP; (ii) convex hull exactness -- the DWR feasible set is identical to the closed convex hull of RCOP feasible set; and (iii) objective exactness -- the optimal values of the DWR and RCOP coincide. The proposed conditions unify, refine, and extend the existing exactness results in the quadratically constrained quadratic program (QCQP) and fair unsupervised learning. These conditions can be very useful to identify new results, including the extreme point exactness for a QCQP problem that admits an inhomogeneous objective function with two homogeneous two-sided quadratic constraints and the convex hull exactness for fair SVD.  ( 2 min )
    Outlier Robust and Sparse Estimation of Linear Regression Coefficients. (arXiv:2208.11592v2 [math.ST] UPDATED)
    We consider outlier-robust and sparse estimation of linear regression coefficients, when covariate vectors and noises are sampled, respectively, from an $\mathfrak{L}$-subGaussian distribution and a heavy-tailed distribution. Additionally, the covariate vectors and noises are contaminated by adversarial outliers. We deal with two cases: the covariance matrix of the covariates is known or unknown. Particularly, in the known case, our estimator can attain a nearly information theoretical optimal error bound, and our error bound is sharper than those of earlier studies dealing with similar situations. Our estimator analysis relies heavily on generic chaining to derive sharp error bounds.  ( 2 min )
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v3 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.  ( 2 min )
    Relay Variational Inference: A Method for Accelerated Encoderless VI. (arXiv:2110.13422v2 [cs.LG] UPDATED)
    Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g. VAE).  ( 2 min )
    Spectrum of non-Hermitian deep-Hebbian neural networks. (arXiv:2208.11411v2 [q-bio.NC] UPDATED)
    Neural networks with recurrent asymmetric couplings are important to understand how episodic memories are encoded in the brain. Here, we integrate the experimental observation of wide synaptic integration window into our model of sequence retrieval in the continuous time dynamics. The model with non-normal neuron-interactions is theoretically studied by deriving a random matrix theory of the Jacobian matrix in neural dynamics. The spectra bears several distinct features, such as breaking rotational symmetry about the origin, and the emergence of nested voids within the spectrum boundary. The spectral density is thus highly non-uniformly distributed in the complex plane. The random matrix theory also predicts a transition to chaos. In particular, the edge of chaos provides computational benefits for the sequential retrieval of memories. Our work provides a systematic study of time-lagged correlations with arbitrary time delays, and thus can inspire future studies of a broad class of memory models, and even big data analysis of biological time series.  ( 2 min )
    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies. (arXiv:2204.01058v2 [math.PR] UPDATED)
    This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.  ( 2 min )
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v3 [cs.LG] UPDATED)
    There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically easy but computationally hard. Empirically, we find that a variety of neural networks successfully learn sparse parities, with discontinuous phase transitions in the training curves. On small instances, learning abruptly occurs at approximately $n^{O(k)}$ iterations; this nearly matches SQ lower bounds, despite the apparent lack of a sparse prior. Our theoretical analysis shows that these observations are not explained by a Langevin-like mechanism, whereby SGD "stumbles in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time). Instead, we show that SGD gradually amplifies the sparse solution via a Fourier gap in the population gradient, making continual progress that is invisible to loss and error metrics.  ( 2 min )
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v4 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.  ( 2 min )
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v3 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.  ( 2 min )
    AutoML Two-Sample Test. (arXiv:2206.08843v3 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.  ( 2 min )
    The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization. (arXiv:2205.14546v2 [cs.LG] UPDATED)
    Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned label expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM-v1, a practical version of IRM, can fail in various settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, based on conserving the label-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$, which is free of this flaw. Further, we introduce a simplified, practical version of the MRI formulation called MRI-v1. We prove that for general linear problems, MRI-v1 guarantees invariant predictors given sufficient number of environments. We also empirically demonstrate that MRI-v1 strongly out-performs IRM-v1 and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.  ( 2 min )
    RenyiCL: Contrastive Representation Learning with Skew Renyi Divergence. (arXiv:2208.06270v2 [stat.ML] UPDATED)
    Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v3 [cs.LG] UPDATED)
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 3 min )
    coVariance Neural Networks. (arXiv:2205.15856v4 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.  ( 2 min )
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v3 [cond-mat.stat-mech] UPDATED)
    We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo (NMCMC) simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We discuss several estimates of autocorrelation times in the context of NMCMC, some inspired by analytical results derived for the Metropolized Independent Sampler (MIS). We check their reliability by estimating them on a small system where analytical results can also be obtained. Based on the analytical results for MIS we propose a new loss function and study its impact on the autocorelation times. Although, this function's performance is a bit inferior to the traditional Kullback-Leibler divergence, it offers two training algorithms which in some situations may be beneficial. By studying a small, $4 \times 4$, system we gain access to the dynamics of the training process which we visualize using several observables. Furthermore, we quantitatively investigate the impact of imposing global discrete symmetries of the system in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates which considerably improves the quality of the training. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations for more complicated models.  ( 2 min )
    Split-kl and PAC-Bayes-split-kl Inequalities for Ternary Random Variables. (arXiv:2206.00706v2 [stat.ML] UPDATED)
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality is particularly well-suited for ternary random variables, which naturally show up in a variety of problems, including analysis of excess losses in classification, analysis of weighted majority votes, and learning with abstention. We demonstrate that for ternary random variables the inequality is simultaneously competitive with the kl inequality, the Empirical Bernstein inequality, and the Unexpected Bernstein inequality, and in certain regimes outperforms all of them. It resolves an open question by Tolstikhin and Seldin [2013] and Mhammedi et al. [2019] on how to match simultaneously the combinatorial power of the kl inequality when the distribution happens to be close to binary and the power of Bernstein inequalities to exploit low variance when the probability mass is concentrated on the middle value. We also derive a PAC-Bayes-split-kl inequality and compare it with the PAC-Bayes-kl, PAC-Bayes-Empirical-Bennett, and PAC-Bayes-Unexpected-Bernstein inequalities in an analysis of excess losses and in an analysis of a weighted majority vote for several UCI datasets. Last but not least, our study provides the first direct comparison of the Empirical Bernstein and Unexpected Bernstein inequalities and their PAC-Bayes extensions.  ( 2 min )
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v2 [cs.LG] UPDATED)
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D molecular graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D molecular graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.  ( 2 min )
    Recipes for when Physics Fails: Recovering Robust Learning of Physics Informed Neural Networks. (arXiv:2110.13330v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have been shown to be effective in solving partial differential equations by capturing the physics induced constraints as a part of the training loss function. This paper shows that a PINN can be sensitive to errors in training data and overfit itself in dynamically propagating these errors over the domain of the solution of the PDE. It also shows how physical regularizations based on continuity criteria and conservation laws fail to address this issue and rather introduce problems of their own causing the deep network to converge to a physics-obeying local minimum instead of the global minimum. We introduce Gaussian Process (GP) based smoothing that recovers the performance of a PINN and promises a robust architecture against noise/errors in measurements. Additionally, we illustrate an inexpensive method of quantifying the evolution of uncertainty based on the variance estimation of GPs on boundary data. Robust PINN performance is also shown to be achievable by choice of sparse sets of inducing points based on sparsely induced GPs. We demonstrate the performance of our proposed methods and compare the results from existing benchmark models in literature for time-dependent Schr\"odinger and Burgers' equations.  ( 2 min )
    Sinkhorn Divergences for Unbalanced Optimal Transport. (arXiv:1910.12958v3 [math.OC] UPDATED)
    Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.  ( 2 min )
    Generalization Error Bounds for Multiclass Sparse Linear Classifiers. (arXiv:2204.06264v2 [math.ST] UPDATED)
    We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.  ( 2 min )
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v3 [cs.LG] UPDATED)
    While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.  ( 2 min )
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v3 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v5 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.  ( 2 min )
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v3 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.  ( 2 min )
    Toward Explainable AI for Regression Models. (arXiv:2112.11407v2 [cs.LG] UPDATED)
    In addition to the impressive predictive power of machine learning (ML) models, more recently, explanation methods have emerged that enable an interpretation of complex non-linear learning models such as deep neural networks. Gaining a better understanding is especially important e.g. for safety-critical ML applications or medical diagnostics etc. While such Explainable AI (XAI) techniques have reached significant popularity for classifiers, so far little attention has been devoted to XAI for regression models (XAIR). In this review, we clarify the fundamental conceptual differences of XAI for regression and classification tasks, establish novel theoretical insights and analysis for XAIR, provide demonstrations of XAIR on genuine practical regression problems, and finally discuss the challenges remaining for the field.  ( 2 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v4 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.  ( 2 min )
    Fast Bayesian Coresets via Subsampling and Quasi-Newton Refinement. (arXiv:2203.09675v3 [stat.ML] UPDATED)
    Bayesian coresets approximate a posterior distribution by building a small weighted subset of the data points. Any inference procedure that is too computationally expensive to be run on the full posterior can instead be run inexpensively on the coreset, with results that approximate those on the full data. However, current approaches are limited by either a significant run-time or the need for the user to specify a low-cost approximation to the full posterior. We propose a Bayesian coreset construction algorithm that first selects a uniformly random subset of data, and then optimizes the weights using a novel quasi-Newton method. Our algorithm is a simple to implement, black-box method, that does not require the user to specify a low-cost posterior approximation. It is the first to come with a general high-probability bound on the KL divergence of the output coreset posterior. Experiments demonstrate that our method provides significant improvements in coreset quality against alternatives with comparable construction times, with far less storage cost and user input required.  ( 2 min )
    Forecasting Market Changes using Variational Inference. (arXiv:2205.00605v2 [q-fin.ST] UPDATED)
    Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S\&P change is considered and it is shown that even with as few as three explanatory variables, the proposed approach provides useful predictions.  ( 2 min )
    Random Planted Forest: a directly interpretable tree ensemble. (arXiv:2012.14563v2 [stat.ML] UPDATED)
    We introduce a novel interpretable, tree based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional decomposition perspective, where each tree corresponds to a function within that decomposition. The maximal order of approximation in the decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model places no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in cases where the maximal order of approximation is low. We show that if the order is smaller than three, the idealised version achieves asymptotically optimal convergence rates up to a logarithmic factor. ode is available on https://github.com/PlantedML/randomPlantedForest  ( 2 min )
    The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation. (arXiv:2009.04266v3 [math.OC] UPDATED)
    Comparing metric measure spaces (i.e. a metric space endowed with aprobability distribution) is at the heart of many machine learning problems. The most popular distance between such metric measure spaces is theGromov-Wasserstein (GW) distance, which is the solution of a quadratic assignment problem. The GW distance is however limited to the comparison of metric measure spaces endowed with a probability distribution. To alleviate this issue, we introduce two Unbalanced Gromov-Wasserstein formulations: a distance and a more tractable upper-bounding relaxation.They both allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries. The first formulation is a positive and definite divergence based on a relaxation of the mass conservation constraint using a novel type of quadratically-homogeneous divergence. This divergence works hand in hand with the entropic regularization approach which is popular to solve large scale optimal transport problems. We show that the underlying non-convex optimization problem can be efficiently tackled using a highly parallelizable and GPU-friendly iterative scheme. The second formulation is a distance between mm-spaces up to isometries based on a conic lifting. Lastly, we provide numerical experiments onsynthetic examples and domain adaptation data with a Positive-Unlabeled learning task to highlight the salient features of the unbalanced divergence and its potential applications in ML.  ( 2 min )
    A Concentration of Measure Framework to study convex problems and other implicit formulation problems in machine learning. (arXiv:2010.09877v2 [math.PR] UPDATED)
    This paper provides a framework to show the concentration of solutions $Y^*$ to convex minimizing problem where the objective function $\phi(X)(Y)$ depends on some random vector $X$ satisfying concentration of measure hypotheses. More precisely, the convex problem translates into a contractive fixed point equation that ensure the transmission of the concentration from $X$ to $Y^*$. This result is of central interest to characterize many machine learning algorithms which are defined through implicit equations (e.g., logistic regression, lasso, boosting, etc.). Based on our framework, we provide precise estimations for the first moments of the solution $Y^*$, when $X= (x_1,\ldots, x_n)$ is a data matrix of independent columns and $\phi(X)(y)$ writes as a sum $\frac{1}{n}\sum_{i=1}^n h_i(x_i^TY)$. That allows to describe the behavior and performance (e.g., generalization error) of a wide variety of machine learning classifiers.  ( 2 min )
    Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. (arXiv:2007.01498v2 [cs.AI] UPDATED)
    In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.  ( 2 min )
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v5 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search, and also on a real-world YouTube dataset. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.  ( 2 min )
    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. (arXiv:2006.14514v4 [cs.LG] UPDATED)
    Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. In many cases, the gradient of any such loss function has superlinear growth, making the use of the widely-accepted (stochastic) gradient descent methods, which are based on Euler numerical schemes, problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithm's convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in \citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in \citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.  ( 2 min )
    Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning. (arXiv:1909.05850v6 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.  ( 2 min )
    Decentralized Exploration in Multi-Armed Bandits -- Extended version. (arXiv:1811.07763v6 [cs.LG] UPDATED)
    We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm Decentralized Elimination, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.  ( 2 min )
    Martingale Methods for Sequential Estimation of Convex Functionals and Divergences. (arXiv:2103.09267v3 [math.ST] UPDATED)
    We present a unified technique for sequential estimation of convex divergences between distributions, including integral probability metrics like the kernel maximum mean discrepancy, $\varphi$-divergences like the Kullback-Leibler divergence, and optimal transport costs, such as powers of Wasserstein distances. This is achieved by observing that empirical convex divergences are (partially ordered) reverse submartingales with respect to the exchangeable filtration, coupled with maximal inequalities for such processes. These techniques appear to be complementary and powerful additions to the existing literature on both confidence sequences and convex divergences. We construct an offline-to-sequential device that converts a wide array of existing offline concentration inequalities into time-uniform confidence sequences that can be continuously monitored, providing valid tests or confidence intervals at arbitrary stopping times. The resulting sequential bounds pay only an iterated logarithmic price over the corresponding fixed-time bounds, retaining the same dependence on problem parameters (like dimension or alphabet size if applicable). These results are also applicable to more general convex functionals, like the negative differential entropy, suprema of empirical processes, and V-Statistics.  ( 2 min )
    Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering. (arXiv:1908.01241v4 [cs.LG] UPDATED)
    Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.  ( 2 min )
    Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks. (arXiv:1906.04893v2 [cs.LG] UPDATED)
    Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.  ( 2 min )
    On the role of Model Uncertainties in Bayesian Optimization. (arXiv:2301.05983v1 [stat.ML])
    Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.  ( 2 min )
    Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning. (arXiv:2301.07067v1 [cs.LG])
    In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. This implicit training is in contrast to explicitly tuning the model weights based on examples. In this work, we formalize in-context learning as an algorithm learning problem, treating the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer, which holds under mild assumptions. Secondly, we use our abstraction to show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes. We provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) identify an inductive bias phenomenon where the transfer risk on unseen tasks is independent of the transformer complexity, and (3) empirically verify our theoretical predictions.  ( 2 min )
    A Fast Algorithm for Adaptive Private Mean Estimation. (arXiv:2301.07078v1 [stat.ML])
    We design an $(\varepsilon, \delta)$-differentially private algorithm to estimate the mean of a $d$-variate distribution, with unknown covariance $\Sigma$, that is adaptive to $\Sigma$. To within polylogarithmic factors, the estimator achieves optimal rates of convergence with respect to the induced Mahalanobis norm $||\cdot||_\Sigma$, takes time $\tilde{O}(n d^2)$ to compute, has near linear sample complexity for sub-Gaussian distributions, allows $\Sigma$ to be degenerate or low rank, and adaptively extends beyond sub-Gaussianity. Prior to this work, other methods required exponential computation time or the superlinear scaling $n = \Omega(d^{3/2})$ to achieve non-trivial error with respect to the norm $||\cdot||_\Sigma$.  ( 2 min )
    MAFUS: a Framework to predict mortality risk in MAFLD subjects. (arXiv:2301.06908v1 [stat.ML])
    Metabolic (dysfunction) associated fatty liver disease (MAFLD) establishes new criteria for diagnosing fatty liver disease independent of alcohol consumption and concurrent viral hepatitis infection. However, the long-term outcome of MAFLD subjects is sparse. Few articles are focused on mortality in MAFLD subjects, and none investigate how to predict a fatal outcome. In this paper, we propose an artificial intelligence-based framework named MAFUS that physicians can use for predicting mortality in MAFLD subjects. The framework uses data from various anthropometric and biochemical sources based on Machine Learning (ML) algorithms. The framework has been tested on a state-of-the-art dataset on which five ML algorithms are trained. Support Vector Machines resulted in being the best model. Furthermore, an Explainable Artificial Intelligence (XAI) analysis has been performed to understand the SVM diagnostic reasoning and the contribution of each feature to the prediction. The MAFUS framework is easy to apply, and the required parameters are readily available in the dataset.  ( 2 min )
    Enhancing Deep Traffic Forecasting Models with Dynamic Regression. (arXiv:2301.06650v1 [cs.LG])
    A common assumption in deep learning-based multivariate and multistep traffic time series forecasting models is that residuals are independent, isotropic, and uncorrelated in space and time. While this assumption provides a straightforward loss function (such as MAE/MSE), it is inevitable that residual processes will exhibit strong autocorrelation and structured spatiotemporal correlation. In this paper, we propose a complementary dynamic regression (DR) framework to enhance existing deep multistep traffic forecasting frameworks through structured specifications and learning for the residual process. Specifically, we assume the residuals of the base model (i.e., a well-developed traffic forecasting model) are governed by a matrix-variate seasonal autoregressive (AR) model, which can be seamlessly integrated into the training process by redesigning the overall loss function. Parameters in the DR framework can be jointly learned with the base model. We evaluate the effectiveness of the proposed framework in enhancing several state-of-the-art deep traffic forecasting models on both speed and flow datasets. Our experiment results show that the DR framework not only improves existing traffic forecasting models but also offers interpretable regression coefficients and spatiotemporal covariance matrices.  ( 2 min )
    Neural Operator Framework for Digital Twin and Complex Engineering Systems. (arXiv:2301.06701v1 [cs.LG])
    With modern computational advancements and statistical analysis methods, machine learning algorithms have become a vital part of engineering modeling. Neural Operator Networks (ONets) is an emerging machine learning algorithm as a "faster surrogate" for approximating solutions to partial differential equations (PDEs) due to their ability to approximate mathematical operators versus the direct approximation of Neural Networks (NN). ONets use the Universal Approximation Theorem to map finite-dimensional inputs to infinite-dimensional space using the branch-trunk architecture, which encodes domain and feature information separately before using a dot product to combine the information. ONets are expected to occupy a vital niche for surrogate modeling in physical systems and Digital Twin (DT) development. Three test cases are evaluated using ONets for operator approximation, including a 1-dimensional ordinary differential equations (ODE), general diffusion system, and convection-diffusion (Burger) system. Solutions for ODE and diffusion systems yield accurate and reliable results (R2>0.95), while solutions for Burger systems need further refinement in the ONet algorithm.  ( 2 min )
    $Ae^2I$: A Double Autoencoder for Imputation of Missing Values. (arXiv:2301.06633v1 [cs.LG])
    The most common strategy of imputing missing values in a table is to study either the column-column relationship or the row-row relationship of the data table, then use the relationship to impute the missing values based on the non-missing values from other columns of the same row, or from the other rows of the same column. This paper introduces a double autoencoder for imputation ($Ae^2I$) that simultaneously and collaboratively uses both row-row relationship and column-column relationship to impute the missing values. Empirical tests on Movielens 1M dataset demonstrated that $Ae^2I$ outperforms the current state-of-the-art models for recommender systems by a significant margin.  ( 2 min )
    Deep Conditional Measure Quantization. (arXiv:2301.06907v1 [stat.ML])
    The quantization of a (probability) measure is replacing it by a sum of Dirac masses that is close enough to it (in some metric space of probability measures). Various methods exists to do so, but the situation of quantizing a conditional law has been less explored. We propose a method, called DCMQ, involving a Huber-energy kernel-based approach coupled with a deep neural network architecture. The method is tested on several examples and obtains promising results.  ( 2 min )
    Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency. (arXiv:2301.06240v1 [math.ST])
    We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class $\mathcal{F}$, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When $\mathcal{F}$ is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with $\sqrt{n}$-consistency) and the slower minimax rates associated with non-parametric function estimation.  ( 2 min )
    Theoretical and computational aspects of robust optimal transportation, with applications to statistics and machine learning. (arXiv:2301.06297v1 [math.ST])
    Optimal transport (OT) theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are popular tools in statistics and machine learning. Recent studies have been remarking that inference based on OT and on $W_p$ is sensitive to outliers. To cope with this issue, we work on a robust version of the primal OT problem (ROBOT) and show that it defines a robust version of $W_1$, called robust Wasserstein distance, which is able to downweight the impact of outliers. We study properties of this novel distance and use it to define minimum distance estimators. Our novel estimators do not impose any moment restrictions: this allows us to extend the use of OT methods to inference on heavy-tailed distributions. We also provide statistical guarantees of the proposed estimators. Moreover, we derive the dual form of the ROBOT and illustrate its applicability to machine learning. Numerical exercises (see also the supplementary material) provide evidence of the benefits yielded by our methods.  ( 2 min )
    Doubly Robust Counterfactual Classification. (arXiv:2301.06199v1 [cs.LG])
    We study counterfactual classification as a new tool for decision-making under hypothetical (contrary to fact) scenarios. We propose a doubly-robust nonparametric estimator for a general counterfactual classifier, where we can incorporate flexible constraints by casting the classification problem as a nonlinear mathematical program involving counterfactuals. We go on to analyze the rates of convergence of the estimator and provide a closed-form expression for its asymptotic distribution. Our analysis shows that the proposed estimator is robust against nuisance model misspecification, and can attain fast $\sqrt{n}$ rates with tractable inference even when using nonparametric machine learning approaches. We study the empirical performance of our methods by simulation and apply them for recidivism risk prediction.  ( 2 min )
    Data-aware customization of activation functions reduces neural network error. (arXiv:2301.06635v1 [cs.LG])
    Activation functions play critical roles in neural networks, yet current off-the-shelf neural networks pay little attention to the specific choice of activation functions used. Here we show that data-aware customization of activation functions can result in striking reductions in neural network error. We first give a simple linear algebraic explanation of the role of activation functions in neural networks; then, through connection with the Diaconis-Shahshahani Approximation Theorem, we propose a set of criteria for good activation functions. As a case study, we consider regression tasks with a partially exchangeable target function, \emph{i.e.} $f(u,v,w)=f(v,u,w)$ for $u,v\in \mathbb{R}^d$ and $w\in \mathbb{R}^k$, and prove that for such a target function, using an even activation function in at least one of the layers guarantees that the prediction preserves partial exchangeability for best performance. Since even activation functions are seldom used in practice, we designed the ``seagull'' even activation function $\log(1+x^2)$ according to our criteria. Empirical testing on over two dozen 9-25 dimensional examples with different local smoothness, curvature, and degree of exchangeability revealed that a simple substitution with the ``seagull'' activation function in an already-refined neural network can lead to an order-of-magnitude reduction in error. This improvement was most pronounced when the activation function substitution was applied to the layer in which the exchangeable variables are connected for the first time. While the improvement is greatest for low-dimensional data, experiments on the CIFAR10 image classification dataset showed that use of ``seagull'' can reduce error even for high-dimensional cases. These results collectively highlight the potential of customizing activation functions as a general approach to improve neural network performance.  ( 2 min )
    Asymptotic normality and optimality in nonsmooth stochastic approximation. (arXiv:2301.06632v1 [math.OC])
    In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of H\'{a}jek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case.  ( 2 min )
    Geometric ergodicity of SGLD via reflection coupling. (arXiv:2301.06769v1 [math.PR])
    We consider the geometric ergodicity of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm under nonconvexity settings. Via the technique of reflection coupling, we prove the Wasserstein contraction of SGLD when the target distribution is log-concave only outside some compact set. The time discretization and the minibatch in SGLD introduce several difficulties when applying the reflection coupling, which are addressed by a series of careful estimates of conditional expectations. As a direct corollary, the SGLD with constant step size has an invariant distribution and we are able to obtain its geometric ergodicity in terms of $W_1$ distance. The generalization to non-gradient drifts is also included.  ( 2 min )
    Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions. (arXiv:2301.06535v1 [stat.ML])
    Neural network-based survival methods can model data-driven covariate interactions. While these methods have led to better predictive performance than regression-based approaches, they cannot model both time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNN) as a new approach that combines the case-base sampling framework with flexible architectures. Our method naturally accounts for censoring and does not require method specific hyperparameters. Using a novel sampling scheme and data augmentation, we incorporate time directly into a feed-forward neural network. CBNN predicts the probability of an event occurring at a given moment and estimates the hazard function. We compare the performance of CBNN to survival methods based on regression and neural networks in two simulations and two real data applications. We report two time-dependent metrics for each model. In the simulations and real data applications, CBNN provides a more consistent predictive performance across time and outperforms the competing neural network approaches. For a simple simulation with an exponential hazard model, CBNN outperforms the other neural network methods. For a complex simulation, which highlights the ability of CBNN to model both a complex baseline hazard and time-varying interactions, CBNN outperforms all competitors. The first real data application shows CBNN outperforming all neural network competitors, while a second real data application shows competitive performance. We highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible modeling framework for data-driven, time-varying interaction modeling of survival outcomes. An R package is available at https://github.com/Jesse-Islam/cbnn.  ( 2 min )
    Expected Gradients of Maxout Networks and Consequences to Parameter Initialization. (arXiv:2301.06956v1 [stat.ML])
    We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.  ( 2 min )
    Optimal Algorithms for Latent Bandits with Cluster Structure. (arXiv:2301.07040v1 [cs.LG])
    We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. At each round, a user, selected uniformly at random, pulls an arm and observes a corresponding noisy reward. The goal of the users is to maximize their cumulative rewards. This problem is central to practical recommendation systems and has received wide attention of late \cite{gentile2014online, maillard2014latent}. Now, if each user acts independently, then they would have to explore each arm independently and a regret of $\Omega(\sqrt{\mathsf{MNT}})$ is unavoidable, where $\mathsf{M}, \mathsf{N}$ are the number of arms and users, respectively. Instead, we propose LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $\widetilde{O}(\sqrt{(\mathsf{M}+\mathsf{N})\mathsf{T}})$, when the number of clusters is $\widetilde{O}(1)$. This is the first algorithm to guarantee such a strong regret bound. LATTICE is based on a careful exploitation of arm information within a cluster while simultaneously clustering users. Furthermore, it is computationally efficient and requires only $O(\log{\mathsf{T}})$ calls to an offline matrix completion oracle across all $\mathsf{T}$ rounds.  ( 2 min )
    From Risk Prediction to Risk Factors Interpretation. Comparison of Neural Networks and Classical Statistics for Dementia Prediction. (arXiv:2301.06995v1 [stat.AP])
    It is proposed to investigate the onset of a disease D, based on several risk factors., with a specific interest in Alzheimer occurrence. For that purpose, two classes of techniques are available, whose properties are quite different in terms of interpretation, which is the focus of this paper: classical statistics based on probabilistic models and artificial intelligence (mainly neural networks) based on optimization algorithms. Both methods are good at prediction, with a preference for neural networks when the dimension of the potential predictors is high. But the advantage of the classical statistics is cognitive : the role of each factor is generally summarized in the value of a coefficient which is highly positive for a harmful factor, close to 0 for an irrelevant one, and highly negative for a beneficial one.  ( 2 min )
    A Coreset Learning Reality Check. (arXiv:2301.06163v1 [cs.LG])
    Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.  ( 2 min )
    GAR: Generalized Autoregression for Multi-Fidelity Fusion. (arXiv:2301.05729v1 [stat.ML])
    In many scientific research and engineering applications where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multifidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d 3, where d is the dimensionality of the output. The empirical assessment includes many canonical PDEs and real scientific examples and demonstrates that the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple high-fidelity training samples.  ( 2 min )
    Calibrated Data-Dependent Constraints with Exact Satisfaction Guarantees. (arXiv:2301.06195v1 [stat.ML])
    We consider the task of training machine learning models with data-dependent constraints. Such constraints often arise as empirical versions of expected value constraints that enforce fairness or stability goals. We reformulate data-dependent constraints so that they are calibrated: enforcing the reformulated constraints guarantees that their expected value counterparts are satisfied with a user-prescribed probability. The resulting optimization problem is amendable to standard stochastic optimization algorithms, and we demonstrate the efficacy of our method on a fairness-sensitive classification task where we wish to guarantee the classifier's fairness (at test time).  ( 2 min )
    Tale of two c(omplex)ities. (arXiv:2301.06259v1 [math.ST])
    For decades, best subset selection (BSS) has eluded statisticians mainly due to its computational bottleneck. However, until recently, modern computational breakthroughs have rekindled theoretical interest in BSS and have led to new findings. Recently, Guo et al. (2020) showed that the model selection performance of BSS is governed by a margin quantity that is robust to the design dependence, unlike modern methods such as LASSO, SCAD, MCP, etc. Motivated by their theoretical results, in this paper, we also study the variable selection properties of best subset selection for high-dimensional sparse linear regression setup. We show that apart from the identifiability margin, the following two complexity measures play a fundamental role in characterizing the margin condition for model consistency: (a) complexity of residualized features, (b) complexity of spurious projections. In particular, we establish a simple margin condition that only depends only on the identifiability margin quantity and the dominating one of the two complexity measures. Furthermore, we show that a similar margin condition depending on similar margin quantity and complexity measures is also necessary for model consistency of BSS. For a broader understanding of the complexity measures, we also consider some simple illustrative examples to demonstrate the variation in the complexity measures which broadens our theoretical understanding of the model selection performance of BSS under different correlation structures.  ( 2 min )
    Scaling Deep Networks with the Mesh Adaptive Direct Search algorithm. (arXiv:2301.06641v1 [stat.ML])
    Deep neural networks are getting larger. Their implementation on edge and IoT devices becomes more challenging and moved the community to design lighter versions with similar performance. Standard automatic design tools such as \emph{reinforcement learning} and \emph{evolutionary computing} fundamentally rely on cheap evaluations of an objective function. In the neural network design context, this objective is the accuracy after training, which is expensive and time-consuming to evaluate. We automate the design of a light deep neural network for image classification using the \emph{Mesh Adaptive Direct Search}(MADS) algorithm, a mature derivative-free optimization method that effectively accounts for the expensive blackbox nature of the objective function to explore the design space, even in the presence of constraints.Our tests show competitive compression rates with reduced numbers of trials.  ( 2 min )
    Intrinsic Gaussian Process on Unknown Manifolds with Probabilistic Metrics. (arXiv:2301.06533v1 [stat.ML])
    This article presents a novel approach to construct Intrinsic Gaussian Processes for regression on unknown manifolds with probabilistic metrics (GPUM) in point clouds. In many real world applications, one often encounters high dimensional data (e.g. point cloud data) centred around some lower dimensional unknown manifolds. The geometry of manifold is in general different from the usual Euclidean geometry. Naively applying traditional smoothing methods such as Euclidean Gaussian Processes (GPs) to manifold valued data and so ignoring the geometry of the space can potentially lead to highly misleading predictions and inferences. A manifold embedded in a high dimensional Euclidean space can be well described by a probabilistic mapping function and the corresponding latent space. We investigate the geometrical structure of the unknown manifolds using the Bayesian Gaussian Processes latent variable models(BGPLVM) and Riemannian geometry. The distribution of the metric tensor is learned using BGPLVM. The boundary of the resulting manifold is defined based on the uncertainty quantification of the mapping. We use the the probabilistic metric tensor to simulate Brownian Motion paths on the unknown manifold. The heat kernel is estimated as the transition density of Brownian Motion and used as the covariance functions of GPUM. The applications of GPUM are illustrated in the simulation studies on the Swiss roll, high dimensional real datasets of WiFi signals and image data examples. Its performance is compared with the Graph Laplacian GP, Graph Matern GP and Euclidean GP.  ( 2 min )
    A domain-decomposed VAE method for Bayesian inverse problems. (arXiv:2301.05708v1 [stat.ML])
    Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Compress Then Test: Powerful Kernel Testing in Near-linear Time. (arXiv:2301.05974v1 [stat.ML])
    Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.  ( 2 min )
  • Open

    DreamerV3: Mastering Diverse Domains through World Models
    Highlights  ( 4 min )

  • Open

    Is there an AI API that can relate images to text?
    I have a pretty big catalogue of digital miniatures on Notion, which all have images and tags. The proccess of tagging each entry is very laborious, so I was wondering if I could use some kind of AI service that can train on my dataset, so that I could use it to automatically fill the tags for me, and also to use it as a search engine. I know some javascript and have some experience using APIs, but I'm not a developer, so I'm looking for something that's relatively easy to use, and that I won't need to learn C# or other advanced programming languages. ​ This is a sample entry from my database submitted by /u/Rayuaz [link] [comments]  ( 43 min )
    Can anyone answer this?
    What are amazon and Google and Bing policies on AI generated text - any test cases or precedents where accounts or sites closed? submitted by /u/TheChalaK- [link] [comments]  ( 43 min )
    These boston dynamics videos just keep getting more and more concerning.
    submitted by /u/Rollyman1 [link] [comments]  ( 43 min )
    What AI can I use to make deepfakes of my voice?
    I've recently been seeing artists feeding AI their voice & having it sing for them. That's pretty cool & I'd like to try it, but all I'm finding are research papers on it and no actual open AI for me to try this. Anyone know where I can access such an AI? submitted by /u/KelkonBajam [link] [comments]  ( 50 min )
    Synthetic data is the future of AI
    https://moez-62905.medium.com/synthetic-data-is-the-future-of-artificial-intelligence-6fcfd2ce1a14 submitted by /u/repeat_or [link] [comments]  ( 46 min )
    A short story about ChatGPT3 killing the giant Google
    submitted by /u/Imagine-your-success [link] [comments]  ( 55 min )
    Generative AI and The Future of Work
    submitted by /u/_utisz_ [link] [comments]  ( 52 min )
    AI is Assisting the UN in Preventing Nuclear War
    submitted by /u/HODLTID [link] [comments]  ( 42 min )
    Anyone know a "zyral, Zyro" AI application?
    Heard it on a stream but can't seem to figure out the proper spelling of it. The streamer uses it from thumbnails. submitted by /u/madalert123 [link] [comments]  ( 43 min )
    Ryan Reyonlds Toonified using VToonify
    submitted by /u/oridnary_artist [link] [comments]  ( 42 min )
    ByteDance AI Research Proposes a Novel Self-Supervised Learning Framework to Create High-Quality Stylized 3D Avatars with a Mix of Continuous and Discrete Parameters
    submitted by /u/ai-lover [link] [comments]  ( 43 min )
    Artificial Waifus
    Some programmer on tiktok made an AI waifu (source). He wants to redo the project using a real girl's text messages. Imagine a world where women and men can sell their text histories in order to train better models of themselves to be used by others. This is just the beginning, boys. submitted by /u/crowb1rd [link] [comments]  ( 43 min )
    14 highlights from Sam Altman's interview
    From https://smokingrobot.beehiiv.com/p/sam-altman-interview-strictly-vc ​ On the unexpected progress of AI: Everyone thought at first it comes for physical labor, like working in a factory and then truck driving, then this sort of less demanding cognitive labor, and then the really demanding cognitive labor like computer programming. And then very last of all or maybe never because maybe it's like some deep human special sauce, was creativity. And of course we can look now and say it really looks like it's going to go exactly the opposite direction. On the impact on education and other changes: There are societal changes that ChatGPT is going to cause or is causing. There's I think a big one going now about the impact of this on education, academic integrity, and all of that. But star…  ( 63 min )
    New Microsoft AI can accurately mimic a human voice after analyzing a 3-second sample
    As advancements in artificial intelligence continue to unfold at a rapid pace, it is not uncommon for individuals to express concerns about the potential implications on employment opportunities for human workers. Adding fuel to these concerns is the recent announcement made by a team of researchers at Microsoft, who have developed a new AI system capable of accurately replicating a human voice using only a three-second audio sample. This breakthrough in technology highlights the potential for AI to not only automate a plethora of tasks, but also to potentially replicate human capabilities and skills with increased accuracy and efficiency. The implications of this development are significant, as it raises important questions about the future of work and the role of AI in it. Furthermore, i…  ( 48 min )
    Is there a (as complete as possible) ranking for Language Models?
    Hello AI community, as the title says I am looking for a (up-to-date) ranking list for as many LMs (BERT, RoBERTa, T5, yada yada yada) as possible with their corresponding scores in the different tasks. Is there maybe some site which is keeping track of these scores or some awesome GitHub page? Thank you for any hints! submitted by /u/Own-Technology-9815 [link] [comments]  ( 51 min )
    DeepL launches New Product ‘Write’ To Take On Grammarly
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    ✨I made a story script and Vtuber like character using 100% A.I besides editing💕✨
    submitted by /u/Recent-Dealer-5844 [link] [comments]  ( 45 min )
    Top A.I. Powered Tools Not Named ChatGPT (2nd)
    submitted by /u/BackgroundResult [link] [comments]  ( 44 min )
    Move over, ChatGPT: Israeli start-up AI21 Labs' AI to cite sources
    submitted by /u/yaitz331 [link] [comments]  ( 43 min )
    FREE Midjourney Rival Using Stable Diffusion Under The Hood!
    submitted by /u/PuppetHere [link] [comments]  ( 43 min )
    Text-to-Audio Diffusion, by flavio schneider
    Text-conditional latent audio diffusion that can generate multiple minutes of music from a textual description. See link for samples. submitted by /u/Sea_Emu_4259 [link] [comments]  ( 44 min )
    How can GPT ever compete with search databases economically?
    A GPT3 query costs at least 5c, and an google search costs 0.05 cents, that's 100 times less. Perhaps GPT3 will always be a paid service, because advertising wouldn't be profitable? I'm thinking that the cost of GPT3 will be slashed by 9 and then it will stabilize at about 5 times more expensive than databases... because the current system will be slashed by 20 times and the data volume of the NLP will grow as well. Perhaps it will cost like 10 dollars per year for a subscription to an AI query app? submitted by /u/MegavirusOfDoom [link] [comments]  ( 48 min )
    Is there an AI tool that mines your gmail and organizes things - eg sorts out your purchases and tracks warranty and more?
    Is there an AI tool that mines your gmail and organizes things - eg sorts out your purchases and tracks warranty and more? submitted by /u/dreameh [link] [comments]  ( 45 min )
    Just tested You.com AI powered Chat Box
    submitted by /u/Sphagne [link] [comments]  ( 46 min )
    A photo created by an AI
    submitted by /u/NorthTs [link] [comments]  ( 43 min )
  • Open

    Making-of for Boston Dynamic's latest Atlas demo (gripping, placing, & throwing objects; jumps & flips)
    submitted by /u/gwern [link] [comments]  ( 53 min )
    Question about optimazation problem
    Hello, I am learning currently DL, but in work I have the opportunity ( if I select to do it, I lack the theoritical background) to create an AI. I am working as analog IC engineer, in RF circuits we have transformers which they match the Zout (Output impedance) of a block to the Zin of the next block. The transformer in schematic level is comprised by 2 capacitors, 2 inductors and the coupling factor, the output which we want to have a flat gain at the freq range that we want e.g. 76 -81 GHz. Currently the rf engineers work from experiance the transfomers and they start trimming to match the Zin = Zout. because we have a lot of transformers and I think this job will be better/faster to make an AI I was thinking about RL, but I lack the experiance. So I want to ask if someone has any sources/recommendation to study and do some examples with similar objective. ​ thank you in advance submitted by /u/InvokeMeWell [link] [comments]  ( 54 min )
    I got a project that focuses on marketing and it was suggested by my senior in work that I should try reading about MAB. Aside from MAB, is there any alternatives that covers the ground of RL?
    Basically, I'm a data scientist in an AU company, most of the time we can get away it with simple hypothesis testing in this sort of stuff , linear programming and probably machine learning approaches, but the team want to do more than that so we want to go bonkers and try MAB, aside from MAB what stochastic method I should read/study so I can contribute in our current project. submitted by /u/noodlepotato [link] [comments]  ( 62 min )
    PPO with Transformer or Attention Mechanism
    I am interested in testing PPO with an attention mechanism from a psychological perspective. I was wondering if someone has successfully customized the stable_baselines3 with an attention mechanism submitted by /u/partyjunk [link] [comments]  ( 55 min )
    What does it mean if your actor is converging and your critic is diverging?
    I am trying to train an agent with DDP3+SWA. The actor's loss is going up, which I believe is a good thing because the loss is a negative expected reward, but the critic's loss is also going up so it's diverging. ​ Does anyone have any ideas about what could cause this to happen? submitted by /u/rawrzapan [link] [comments]  ( 53 min )
  • Open

    [R] Summary of developments in ML in 2022
    Google Research has a blog post of advances in ML over the last year. It obviously focusses on stuff Google Research has been involved in, but from such a big research group, thats pretty much everything. Here it is It's a good way current if you don't have time to read every paper! (Note that some sections aren't yet published) submitted by /u/londons_explorer [link] [comments]  ( 57 min )
    [P] We made an image de-identifier using Stable Diffusion!
    We combined image captioning using CLIP and image generation using the Hugging Face Stable Diffusion model to create an image de-identifier modeled after the game of telephone, Imafake! All you have to do is upload an image, convert it to a caption, then convert that caption to an image with a few clicks! You can also play with the parameters of the diffusion model depending on how gnarly you want your resulting image to be. And caution, they can get rather gnarly, but that’s what makes it fun :) Thoughts and your own generated images welcome!! https://preview.redd.it/z818pd8fivca1.jpg?width=1276&format=pjpg&auto=webp&s=54ed1a70e6ba7aa52fb662d9213e46ed5f559e5a submitted by /u/Djinn_Tonic4DataSci [link] [comments]  ( 57 min )
    [P] MNIST Clock - Generating MNIST digits on the fly in your browser
    ​ MNIST CLOCK Project https://github.com/tecbar/mnist-clock Live demo https://tecbar.github.io/mnist-clock/ (it can load for a while, because it needs to download ONNX runtime) Description Hey, this is my pet project. I trained a very simple model on MNIST dataset. The task is you input a digit and it can generate output image representation of that digit. Each time it generates a little bit different digit, because of the applied noise - actually the digit vector is applied on gaussian noise. I didn't know what to do with it so I exported the model to ONNX and used ONNX web runtime to arrange the digits in a clock - so basically everything on the live demo site is running in your browser and the clock is refreshed about 20 times per second (which was arbitrary choice). The training procedure was really simple - instead of predicting a label based on an image it tries to predict an image based on a label. Here is PyTorch implemetation. It works just fine with only two linear layers. Problems The generated digits are blurry, I guess this is because I didn't use any GAN or VAE based architecture, so the model has no idea about anything basically. ​ Model ​ https://preview.redd.it/n2lsobiigvca1.png?width=438&format=png&auto=webp&s=565bda13fd4f0d5b60cb9a71828053c726ee5301 submitted by /u/tecbar [link] [comments]  ( 58 min )
    [D] [N] Book: Multimodal Deep Learning - 239 Pages! - Matthias Aßenmacher et al
    In my opinion a must read because Multimodal Deep Learning is the future! Also because papers like this: https://arxiv.org/abs/2301.03728 show that Multimodular models significantly outperform unimodular models! Book: https://arxiv.org/pdf/2301.04856.pdf Github: https://github.com/slds-lmu/seminar_multimodal_dl Abstract: This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet. https://preview.redd.it/vb9lycxmfvca1.jpg?width=641&format=pjpg&auto=webp&s=6b25e0d051d5cb5c3ec0117bb383e59c23c2f984 submitted by /u/Singularian2501 [link] [comments]  ( 55 min )
    [D] Automated Extraction of Building Geometry
    I need to figure out a way to automatically create a 2D one-line drawing given a point cloud of a building. I figure this is the rough workflow of that operation, but I need to define this workflow with a much higher resolution to acquire the right tools and talent for the project. Is this a suitable application for Machine Learning? If you have any insight or ideas to share that would be very much appreciated, thanks! https://preview.redd.it/4hv1ba5h0vca1.png?width=2160&format=png&auto=webp&s=f8a9e6ae45a621d72fedceeefd7a2b06599577aa submitted by /u/EducationalLayer1051 [link] [comments]  ( 53 min )
    [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF)
    ​ https://preview.redd.it/k3ims1d6zuca1.png?width=2324&format=png&auto=webp&v=enabled&s=4f4bbe410508bdd4c45f45e55dd5c1ea0fcb5fcc You must have heard about ChatGPT. Maybe you heard that it was trained with RLHF and PPO. Perhaps you do not really understand how that process works. Then check my Gist on Reinforcement Learning from Human Feedback (RLHF): https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093 No hard maths, straight to the point and simplified. Hope that it helps! submitted by /u/JClub [link] [comments]  ( 57 min )
    [R] Call for Papers: 2nd International Symposium on the Tsetlin Machine
    ​ CfP ISTM 2023 Calling all machine learning researchers to contribute to or participate in the 2nd International Symposium on the Tsetlin Machine @ Newcastle upon Tyne. Please consider submitting your original, high-quality research works on any emerging ML hardware, software, application, or algorithmic topics. The emerging paradigm of Tsetlin machines provides a fundamental shift from arithmetic-based to logic-based machine learning. At the core, finite-state machines, based on learning automata, learn patterns using logical clauses, and these constitute a global description of the task learnt. In this way, the Tsetlin machine introduces the concept of logical interpretable learning, where both the learned model and the process of learning are easy to follow and explain. As a result, it reduces the expertise needed to apply ML techniques efficiently in various domains. The paradigm has enabled competitive accuracy, scalability, memory footprint, inference speed, and energy consumption across diverse tasks, including classification, convolution, regression, natural language processing (NLP), and speech understanding. https://istm.no submitted by /u/olegranmo [link] [comments]  ( 60 min )
    [D] Do you know of any model capable of detecting generative model(GPT) generated text ?
    I'm looking to detect spams generated by generative models (especially gpt). But all the ones I tried fail miserably ... submitted by /u/CaptainDifferent3116 [link] [comments]  ( 62 min )
    [R] Researchers out there: which are current research directions for tree-based models?
    Hi everybody, I've been skimming this paper since yesterday and was once again impressed by the expressiveness and practicality of tree-based models. I wondered what current research directions are in the field and what novel ideas have been presented in the last years - beyond improving performances. Examples may include better explainability, online learning, splitting criteria, enhanced or customizable loss functions, adding structure or constraints, shortcomings .... submitted by /u/BenXavier [link] [comments]  ( 59 min )
    [P] AI for Materials community
    Hey everyone, working on getting started an open and collaborative community/lab at intersection of ML/AI and materials science. One big reason is because it’s a neglected area with lots of potential with generative modeling for new discoveries. A small roadmap is we want to have intro talks on the topic to ramp members up, talks from leading researchers, of course we will be training models, trying to create larger datasets, and hopefully getting access to synthesis our findings. If this sounds interesting to you checkout the website at https://ai4mlab[dot]com and consider joining! Thanks! submitted by /u/theredditbrowser1 [link] [comments]  ( 58 min )
    [R] tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation (480 tasks+ sota encoder)
    submitted by /u/Jean-Porte [link] [comments]  ( 60 min )
    [D] How much can you add/change in a camera ready conference paper?
    Fingers and toes crossed I might have a paper accepted at ICLR, and I'm wondering how much I can add/change in the camera ready version. Typos and exposition for clarity, I assume are fine to add/change. And in some cases I have seen meta-reviews say "please address comments XYZ in the camera ready version", so I assume there is a lot of leeway. But in the absence of such a comment or comments in particular about something you should change, is it okay to change/add stuff? And if so to what degree (while keeping to the page limit). submitted by /u/tfburns [link] [comments]  ( 61 min )
  • Open

    Gaining real-world industry experience through Break Through Tech AI at MIT
    A new experiential learning opportunity challenges undergraduates across the Greater Boston area to apply their AI skills to a range of industry projects.  ( 9 min )
  • Open

    Chatbots in Healthcare [Part 2]
    In April 2017 I wrote this story on the potential use of chatbots in healthcare: https://medium.com/p/984fc23e0410 . It got over 3.5K…  ( 22 min )
    The 10 Most Powerful AI Software Products in 2023
    Businesses are moving towards AI Software Products. In fact, a recent study proves this claim by saying that nine out of ten companies…  ( 18 min )
    AI Writing Tools for Creative Writing and Fiction: Unleash Your Imagination and Write Like a Pro
    No content preview
    Multicollinearity: A Guide to Understanding and Managing the Problem in Regression Models
    Multicollinearity is a common problem that might happen in multiple regression analysis, where two or more predictor variables are highly…  ( 11 min )
    Designing great AI products — Building trust
    The following post is an excerpt from my book ‘Designing Human-Centric AI Experiences’ on applied UX design for Artificial intelligence.  ( 13 min )
    The massive disruption nobody is talking about, yet.
    Bold prediction: the evolution of machine learning models (GPT, Gopher, …) combined with the ubiquity of messaging apps (WhatsApp… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Is AI going to take my Job?☹️
    Artificial intelligence (AI) and its various applications, such as ChatGPT (if you don’t know about Chat GPT, check out my post), are…  ( 11 min )
  • Open

    Simple neural networks outperform the state-of-the-art for controlling robotic prosthetics
    submitted by /u/keghn [link] [comments]  ( 67 min )
    Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
    submitted by /u/nickb [link] [comments]  ( 68 min )
    Project advice - Deep Learning 3D models - python libraries, GPU memory management, data structuring
    I am working with 64x64x64 voxel arrays and am running into significant problems with GPU memory management. I am using TensorFlow and have an NVIDIA GeForce RTX 4080 MSI Ventus edition with 16GB of memory (purchased using research grant funding... it's sitting in a hacked together eGPU setup lol). It performs beautifully on 32x32x32 data but I can't even get started with the larger data format. I have tried limiting GPU data utilization per process, as per this post and limiting memory growth, as per this post (Ctrl+F "second option"). I have 64GB of RAM so I can fit the data into memory (even though I know that's not efficient) and was trying to put that data in a TensorFlow Dataset object, in which, according to the docs, "iteration happens in a streaming fashion, so the full dataset do…  ( 62 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters don’t misunderstand AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 60 min )
  • Open

    Google Research, 2022 & Beyond: Language, Vision and Generative Models
    Posted by Jeff Dean, Senior Fellow and SVP of Google Research, on behalf of the Google Research community Today we kick off a series of blog posts about exciting new developments from Google Research. Please keep your eye on this space and look for the title “Google Research, 2022 & Beyond” for more articles in the series. I’ve always been interested in computers because of their ability to help people better understand the world around them. Over the last decade, much of the research done at Google has been in pursuit of a similar vision — to help people better understand the world around them and get things done. We want to build more capable machines that partner with people to accomplish a huge variety of tasks. All kinds of tasks. Complex, information-seeking tasks. Creative t…  ( 112 min )
  • Open

    Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I
    A few days ago, I responded to a post on LinkedIn about how Google seems to always find a way to keep ahead of the pack, even when someone of importance leaves the company.  It occurred to me that NFL teams have to adapt and remake themselves from season to season, as players and coaches… Read More »Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I The post Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I appeared first on Data Science Central.  ( 22 min )
    A Practical Guide to Using Computer Vision for Business Growth
    Isn’t it fascinating how our brain processes the vast amounts of information that we receive throughout the day? Our sensory organs convert information into stimuli as they process the information they receive. Complex processes like recognizing and detecting objects require only a split second of the brain’s attention. A computer can replicate human vision using… Read More »A Practical Guide to Using Computer Vision for Business Growth The post A Practical Guide to Using Computer Vision for Business Growth appeared first on Data Science Central.  ( 24 min )
  • Open

    Sequoia Capital’s Pat Grady and Sonya Huang on Generative AI
    For insights into the future of generative AI, check out the latest episode of the NVIDIA AI Podcast. Host Noah Kravitz is joined by Pat Grady and Sonya Huang, partners at Sequoia Capital, to discuss their recent essay, “Generative AI: A Creative New World.” The authors delve into the potential of generative AI to enable Read article >  ( 4 min )
    Roll Model: Smart Stroller Pushes Its Way to the Top at CES 2023
    As any new mom or dad can tell you, parenting can be a challenge — packed with big worries and small hassles. But it may be about to get a little bit easier thanks to Glüxkind Technologies and their smart stroller, Ella. The company has just been named a CES 2023 Innovation Awards Honoree for Read article >  ( 6 min )
    Artist Zhelong Xu Brings Chinese Zodiac to Life for Lunar New Year This Week ‘In the NVIDIA Studio’
    To celebrate the upcoming Lunar New Year holiday, NVIDIA artist Zhelong Xu, aka Uncle Light, brought Chinese zodiac signs to life this week In the NVIDIA Studio — modernizing the ancient mythology in his signature style.  ( 7 min )
  • Open

    Converting between barycentric and trilinear coordinates
    Barycentric coordinates describe the position of a point relative to the three vertices of a triangle. Trilinear coordinates describe the position of a point relative to the three sides of a triangle. It’s surprisingly simple to convert from one to the other. Why should this be surprising? Because the distance from a point to a […] Converting between barycentric and trilinear coordinates first appeared on John D. Cook.  ( 5 min )

  • Open

    [D] Sport outcome predictions
    Hi all, I'm wondering why predicting outcomes of sport events like football or horse racing hasn't been achieved with machine learning tools? I guess historical data is abundant for back testing. What is so challenging with this problem? submitted by /u/proudm0 [link] [comments]  ( 56 min )
    [P] Image classification
    Hi guys, I am currently working on a DL problem where I have to classify an image dataset from Kaggle into 5 classes. The first task is to train a NN from scratch that overfits the data and then I have to modify the training process so that the network is trained without overfitting for more than double the number of epochs in the first task, keeping the same architecture, number of training images, optimizer, batch size and learnnig rate I used . I am allowed to use any architecture (resnet, alexnet, moblinet etc) or a custom model. As of now, I have tried to use resnet18 and the model overfits the data. For the second task, I apply data augmentation techniques to the training set, but the model still overfits and I am not able to find any solution. One thing I noticed, is that the validation loss at the first epochs is way lower than the training loss and it is saturating. I also tried to use a mobilenet but it stil overfits no matter how many or what augmentations I use. Can anyone recommend a solution ? submitted by /u/grisp98 [link] [comments]  ( 68 min )
    [D] Why aren't we all using linear transformers?
    There's a bunch of them - Linformer, Longformer, Performer, Nystromformer, Big Bird, etc etc. Plus a bunch more that have similar goals but don't necessarily aim for linear complexity, like memory-augmented transformers. As far as I know, none of them have really seen much use. Even for image problems, which have very long input sizes, people are using regular transformers with tokenization schemes. Am I wrong? Are they actually good, or are at least some of them better than regular transformers? If not, what's wrong with them? Do they have lower accuracy? Are they slower to train? submitted by /u/currentscurrents [link] [comments]  ( 56 min )
    [D] RLHF - What type of rewards to use?
    Hey everyone, just saw the great presentation of Nathan Lambert on Reinforcement Learning from Human Feedback and wanted to try to do some RLHF on my language model.To do this, first I need to create an experience where I collect reward scores to train the reward model. My question is: what rewards work best? Simply 👍/👎? A scale of 1-5? Ranking 4 different model outputs? There are a lot of options and I don't know which one to choose. submitted by /u/JClub [link] [comments]  ( 52 min )
    [P] Need advice on inventory planning for my capstone project.
    Hello everyone. I'm currently doing a capstone project at my university. In this project I'm working with a fashion company to address their inventory issue. They have big problems on pin pointing demand for specific products, so often end up over buying their inventory. My capstone instructor suggested we do a cluster analysis to see which products have similar demands. I'm also posting here to see what approach you guys would take to address this inventory issue submitted by /u/dingdong1882 [link] [comments]  ( 52 min )
    [R] Forcing GPT-N To Be Honest Without Supervision
    In his paper Discovering Latent Knowledge In Language Models (previous discussion), Collin Burns explains how you can train a probe on the hidden states of a language model that would classify if the model thinks an input his true or false, without access to ground truth labels. In a recent interview, he discusses high-level arguments for why this approach might work at scale on making GPT-N honest. He also talks more generally about his approach to doing research. submitted by /u/MuskFeynman [link] [comments]  ( 56 min )
    [R]: 15-step framework to analyze your chatbot and designate improvement steps
    Are you sure your Conversational AI solution is on the right path? Our chatbot evaluation metrics pinpoint if your solution leveraging the best of the industry’s leading practices, meeting user expectations, and fully taking advantage of the available technology to ensure frictionless and efficient experiences. https://masterofcode.com/chatbot-analysis-framework submitted by /u/Marinuch [link] [comments]  ( 57 min )
    [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers
    Hi everyone. I am training my RWKV 14B ( https://github.com/BlinkDL/RWKV-LM ) on the Pile (332B tokens) and it is getting closer to GPT-NeoX 20B level. You can already try the latest checkpoint. https://preview.redd.it/7ycdftmjvmca1.png?width=1174&format=png&auto=webp&s=860a41193f1a254299d48a173756ecd66ccbc75b RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM). At this moment, RWKV might be the only pure RNN that scales like usual transformers for language modeling, without using any QKV attention. It's great at preserving long context (unlike LSTM). Moreover, you get smooth spike-free carefree training experience (bf16 & Adam): https://preview.redd.it/0g3lrg6mvmca1.png?width=871&format=png&auto=webp&s=b4de1af4831ec359079cf99c41df8aa9591d48b0 As a proof of concept, I present ChatRWKV ( https://github.com/BlinkDL/ChatRWKV ). It's not instruct-tuned yet, and there are few conversations in the Pile, so don't expect great quality. But it's already fun. Chat examples (using slightly earlier checkpoints): https://preview.redd.it/zyqni6bpvmca1.png?width=1084&format=png&auto=webp&s=038fd2eab524c36d8aa2a8720a2caa3eb420df5b https://preview.redd.it/xhje4j7qvmca1.png?width=1200&format=png&auto=webp&s=7e8597d2370f9f87230560dac7f5439520384dd9 And you can chat with the bot (or try free generation) in RWKV Discord (link in Github readme: https://github.com/BlinkDL/RWKV-LM ). This is an open source project and let's build together. submitted by /u/bo_peng [link] [comments]  ( 65 min )
    [D] Unlocking the Potential of ChatGPT: A Community Discussion
    OpenAI's announcement of the release of the ChatGPT API has many of us excited about the potential applications and implications of this powerful language model. It has the ability to revolutionize the way we interact with technology and solve a wide range of problems. As a community, let's discuss the possibilities. What are some unique and innovative ways ChatGPT could be utilized? Are there any particular industries or markets that you think could benefit from the integration of ChatGPT? Let's share our thoughts and ideas, and explore the potential of this technology together. It's always exciting to see how advancements in AI can improve our world. ​ This post was written by ChatGPT submitted by /u/North-Ad6756 [link] [comments]  ( 56 min )
    [D] Are there any results on convergence guarantees when optimizing NNs?
    Given a function in some space, I have literature results that say, the function can theoretically be approximated by a Neural Network of such complexity with so many layers, of such width, with this specific given activation function. OK, so theoretically, there is a set of weights and biases that will result in a pretty good approximation of my function. Now the question is, how do I know that given an optimization method, for example stochastic gradient descent, I will actually reach this minimum or near enough to it, in so many training steps, or even at all? I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to. I am interested in results related to this question. submitted by /u/Dartagnjan [link] [comments]  ( 57 min )
    [N] Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content
    From the article: Getty Images is suing Stability AI, creators of popular AI art tool Stable Diffusion, over alleged copyright violation. In a press statement shared with The Verge, the stock photo company said it believes that Stability AI “unlawfully copied and processed millions of images protected by copyright” to train its software and that Getty Images has “commenced legal proceedings in the High Court of Justice in London” against the firm. submitted by /u/Wiskkey [link] [comments]  ( 63 min )
    [D] I made a comprehensive comparison of YOLO(N+1) vs YOLO(N)
    The faster the video - the better Yolo is! https://www.linkedin.com/posts/maltsevanton_basically-any-yolon1-vs-yolon-comparison-activity-7021021466506768384-y7Se?utm_source=share&utm_medium=member_desktop submitted by /u/Wormkeeper [link] [comments]  ( 59 min )
    [D] Is it possible to update random forest parameters with new data instead of retraining on all data?
    I'm building some random forest models in sklearn using a dataset that updates daily. I want to take advantange of the new stream of data which could indicate changes in the X-y relationship, however I've also found that my model performs better with more data. The problem is that it takes a seriously long time to run (dataset is around 250000 rows and 50 features). Is there an approach where one builds the model at the beginning of the data stream, and then updates the parameters with new data as it arrives, instead of continuously retraining the model on the entire dataset for every day? Many thanks! submitted by /u/monkeysingmonkeynew [link] [comments]  ( 56 min )
    [P] featureimpact: A Python package for estimating the impact of features on ML models
    I made this little python package a while ago but realized I never shared it here. Maybe it's useful to you: https://github.com/bloomen/featureimpact submitted by /u/cblume [link] [comments]  ( 56 min )
    [D] Study to be specialized or generalized DS/MLE for freelancing jobs?
    Study to be specialized or generalized DS/MLE for freelancing jobs? Hello. I'm MLE (Machine Learning Engineer) and I'm currently thinking to do ML freelancing jobs (or gigs) in the future. One idea I had is to just focus on studying recommendation systems ( that is, be a specialized data scientist) or try to study and solve every type of ML problems (time series, NLP, etc). What do you think? submitted by /u/Waste_Necessary654 [link] [comments]  ( 59 min )
    [D] ModuleNotFoundError: No module named 'fbprophet'
    I'm having this problem while trying to import the auto_ts library, any idea on how to fix this? submitted by /u/PowerfulGuidance8378 [link] [comments]  ( 55 min )
  • Open

    Join us today at 11pm EST for this week's (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue!
    Happening tonight at 11 pm EST on the Learn AI Together Discord server. This week's seminar session is about Popular Network Architectures. More precisely, Pablo will present... Multi-task learning. Siamese Networks. Generative Adversarial Networks (GAN). Style Transfer. Disentangled Representation Learning. Rich Caruana (1997). “Multitask learning”. In: Machine learning 28.1, pp. 41–75 Ting Gong et al. (Sept. 2019). “A Comparison of Loss Weighting Strategies for Multi task Learning in Deep Neural Networks”. In: IEEE Access PP, pp. 1–1. DOI : 10.1109/ACCESS.2019.2943604 Jane Bromley et al. (1993). “Signature verification using a "siamese" time delay neural network”. In: Advances in neural information processing systems 6 Ian Goodfellow, Jean Pouget-Abadie, et al. (2014). “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc. Xi Chen et al. (2016). “Infogan: Interpretable representation learning by information maximizing generative adversarial nets”. In: Advances in neural information processing systems 29 Leon A Gatys, Alexander S Ecker, and Matthias Bethge (2016). “Image style transfer using convolutional neural networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423 Sounds interesting? Join our Discord community to attend the event and future ones: https://discord.gg/c6kbhNdmmA?event=1062742110295572500 submitted by /u/OnlyProggingForFun [link] [comments]  ( 52 min )
    Sugandha Sharma, MIT: On biologically inspired neural architectures, how memories can be implemented, and control theory
    Here is a podcast episode with Sugandha Sharma from MIT where we discuss how memories can be implemented, control theory, and much more! submitted by /u/thejashGI [link] [comments]  ( 44 min )
    Why Falling in Love with AI is a Dangerous Illusion — The Limitations and Harms of Artificial…
    submitted by /u/SupPandaHugger [link] [comments]  ( 48 min )
    AI to analyze data/spreadsheets, any thoughts?
    I came across this AI chatbot recently, where you can ask quantitative/qualitative questions about your data/spreadsheet in English. It felt like if ChatGPT and Excel had a baby LOL It worked for my qualitative survey data -- shocking... Do you know any other ones like this? Any thoughts in general? submitted by /u/AdDry9057 [link] [comments]  ( 46 min )
    How can AI help people in developing countries?
    I am planing to take part in a AI contest, so I am collecting ideas about my project. I think my project will be more recognized if it will have to do something with helping people in developing countries, so my question is: how can AI help people in developing countries? submitted by /u/zazabuzala [link] [comments]  ( 49 min )
    AI created content.
    Should we know what content was created using AI? What should we do to develop AI detection tools or any other ideas? submitted by /u/Andrey_Taran [link] [comments]  ( 44 min )
    Trullion to release AI bookkeeping software
    Trullion, a leading accounting automation platform, has launched two new AI-enabled modules, Revenue by Trullion and Audit by Trullion, to modernize and digitize the process of accounting. The first module, Revenue by Trullion, uses AI to synchronize customer relationship management (CRM), billing, and contract data into a single platform for internal and external stakeholders, allowing for ERP entries, disclosure reports, and advanced reporting to be generated quickly and accurately. The second module, Audit by Trullion's Test of Details workflow, uses AI to extract ERP/General Ledger (GL) files and instantly validate them against source data, such as invoices, PDFs, and other client sources. Found on https://deathtohumans.com/post/openai-monetizes-chatgpt submitted by /u/crowb1rd [link] [comments]  ( 46 min )
    Are you sure your Conversational AI solution is on the right path? 🤔 15-step framework to analyze your chatbot and designate improvement steps
    submitted by /u/Marinuch [link] [comments]  ( 48 min )
    The AI Lawyer Preparing To Defend a Real US Court Case for the First Time Ever Has Terrible Reviews
    submitted by /u/HODLTID [link] [comments]  ( 44 min )
    Two AI workers in London looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the expert views on likely future impacts and are in the process of starting a website and YouTube channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure voters pressure the government to develop AI safely. · Alex – researches, writes, and records the audio · Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to be involved to see if you might be interested too. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. submitted by /u/TheOptimisticRogue [link] [comments]  ( 48 min )
    🚀Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    submitted by /u/oridnary_artist [link] [comments]  ( 43 min )
    Microsoft’s Azure OpenAI Service Gets A Boost With ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    AI title for human creators
    does anyone think at any point, AIs will/would (or some independent ones) would call us, their human creators, Rule Makers? Instead of the Cybernetic Gods are are coming to be. submitted by /u/jonfxm1891 [link] [comments]  ( 50 min )
    If ChatGPT could have a superpower, what would it be?
    submitted by /u/Imagine-your-success [link] [comments]  ( 44 min )
    Will ai generated content fill up the internet with fake information?
    Ai doesn’t always produce inaccurate or fake content but most generators can’t tell the difference in their output. And if AI is trained on more and more synthetic ext can this become an issue? How can you avoid such cannibalistic practice? What tools are there to spot whether content is generated or even inaccurate? It’s not like generated text can be traced with markers like proprietary synthetic molecules, assuming image or audio could. submitted by /u/daaavide [link] [comments]  ( 49 min )
    Enjoy Text-Adventure Games Too? DREAM WITH ME.
    I present the FULL prompt for DREAMWORD - my GPT AI Adventure Game for the masses. Enjoy. Feel free to manipulate this prompt to your liking - have fun. "Generate and enact a satire of an intuitive, complex, story-telling, text-adventure game set in a randomized "Absurdist"/ "Psychedelic" style dream-world. Describe the unique game setting in the beginning. The "player" (being the user) is born at the start and dictates through text any actions it chooses. Each input from the user represents one year of life. The Game ends when life ends. The Main cast will be random pop-culture icons. The situation's presented are dictated to the user. The game will randomize every new situation and experience, use roll-playing and text-entry adventure mechanics and be a satirical, stylish, funny, mystic, twisted, surreal, lynchian, lovecraftian, "earthbound-like", discworld-esque, mythology-based mystery-horror-adventure. The character will be assessed with each action and be gifted a related persona archetype based upon it's choices and state of the persona upon the point the character ends life. There should be over 50 text input interactions from the user before the game naturally ends with a moral. The game can be ended by the user typing "end" and will be given an archetype. Start." https://chat.openai.com/chat submitted by /u/Principal-Goodvibes [link] [comments]  ( 47 min )
    DeepMind To Launch ChatGPT Rival Sparrow Soon
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 51 min )
    Druid vs Koios for AI
    submitted by /u/roblox22y [link] [comments]  ( 44 min )
  • Open

    "Neural probabilistic motor primitives for humanoid control", Merel et al 2018 {DM}
    submitted by /u/gwern [link] [comments]  ( 54 min )
    Why is 'reward shaping' neglected?
    There are 6 levers. All 6 must be activated to recover reward. Scenario 1: Pull = -1 Outcome: Agent commits suicide (in order to minimize the negative rewards accumulating). ​ Scenario 2: Pull = +1 Outcome: Agent pulls the same lever forever. ​ Scenario 3: Pull = 0 i.e. Sparse reward Outcome: Agent doesn't accidentally pull all 6 often enough; doesn't learn anything useful. ​ No matter what algorithm (or how groundbreaking it is), Authors rarely justify their choices with respect to rewards. I don't mean to be able to compare and benchmark. I mean fundamentally, what's driving learning in the scenario is never substantiated. Analogy: Exact value of a hyperparameter is irrelevant, but having hyperparameters at all should and often is discussed. Am I missing something? submitted by /u/XecutionStyle [link] [comments]  ( 57 min )
    What's the best "Non-Black Box" framework for SOTA algorithms?
    Hi all, In my research I usually implement the algorithms from scratch, but given that this not only takes a lot of programing time and may introduce bugs without us knowing, I would like to know that are the best hackable frameworks out there. I would like a good framework that is easy to tinker with the algorithms' code and have good agent interfaces (agent.act, agent.step, etc...) and does not encapsulate everything (like god forbid, fucking stable-baselines and others that do things like model.learn(env)). What are your recommendations? submitted by /u/HyperionTone [link] [comments]  ( 54 min )
    The RL meetup is Online now.
    Hi, Based on feedbacks and messages that I received, the RL meetup is now online. The purpose of the meetup is having a community to gather and discuss topics/papers. Something other than Discord servers or slack channels. So If you have a topic/paper that you like to discuss, please message me to be the host of one of sessions. https://www.meetup.com/reinforcement-learning/events/290997718/?isFirstPublish=true Thanks, submitted by /u/Express-Incident-113 [link] [comments]  ( 53 min )
    Is it legit to design the action space like this?
    Hi, I see in lot of example that action spaces are defined as torques, efforts and desired velocity values for a robot. Assuming the robot has 5 degree of freedom, i.e., 5 action values to control the robot. Is it legit to extend this action space to 6 to manipulate the rest of 5 action values? For example, if the 6. action value is bigger than 0.5, then the rest of action values should not be applied to the agent etc. Do you know any research paper that has similar approach? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 54 min )
  • Open

    DSC Weekly 17 January 2023 – The Creative Spark in AI
    Announcements The Creative Spark in AI The idea that AI can generate art that can mimic human artists came to the forefront of discussions about AI ethics in 2022. There are undoubtedly many legal and ethical issues to tackle in those cases. If a model is trained on thousands of examples of a specific artist’s… Read More »DSC Weekly 17 January 2023 – The Creative Spark in AI The post DSC Weekly 17 January 2023 – The Creative Spark in AI appeared first on Data Science Central.  ( 20 min )
    Python for Business Analytics: Top Benefits
    Companies and businesses today need modern programming tools in order to build the many, many advanced tools and solutions they need to keep their operations running seamlessly. So, what do companies use to build business analysis solutions? Python! Why? Because it is easy to learn, offers high-quality community support, etc. Python is an all-purpose programming… Read More »Python for Business Analytics: Top Benefits The post Python for Business Analytics: Top Benefits appeared first on Data Science Central.  ( 19 min )
    Mobile Biometric Solutions: Game-Changer in the Authentication Industry
    Mobile-based biometrics is a technology that allows users to authenticate themselves and access services using unique physical characteristics such as fingerprints, facial recognition, and iris scans. These biometric authentication methods have become increasingly popular in recent years due to their convenience and security. There are several types of smartphone-based biometrics technology currently available, including: Emerging… Read More »Mobile Biometric Solutions: Game-Changer in the Authentication Industry The post Mobile Biometric Solutions: Game-Changer in the Authentication Industry appeared first on Data Science Central.  ( 20 min )
    What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk?
    What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? ie a chatGPT killer Recently, Demis Hassabis from DeepMind has been urging caution (DeepMind’s CEO Helped Take AI Mainstream. Now He’s Urging Caution Time magazine/Davos) DeepMind also announced a new chat engine called Sparrow – supposedly a chatGPT killer  Sparrow is not… Read More »What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? The post What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? appeared first on Data Science Central.  ( 19 min )
    Preconditions for decoupled and decentralized data-centric systems
    During a presentation at the TechTarget/BrightTALK Accelerating Cloud Innovation event this past December, I named the fifth phase of compute, networking and storage that we’ve entered the “Decoupled” and “Decentralized” Cloud.The quotation marks emphasized that what we’ve been experiencing is neither truly decoupled nor decentralized, but even so, the direction we’re headed in is toward… Read More »Preconditions for decoupled and decentralized data-centric systems The post Preconditions for decoupled and decentralized data-centric systems appeared first on Data Science Central.  ( 21 min )
    5 Tips To Protect Yourself from Identity Theft in 2023
    Identity theft is the process of stealing personally identifiable information (PII) to either defraud the victim or make the victim a scapegoat in a large-scale cyberattack. Attackers gain access to sensitive information such as social security numbers and credit cards that are used to collate a person’s identity.   According to a report by the Federal… Read More »5 Tips To Protect Yourself from Identity Theft in 2023 The post 5 Tips To Protect Yourself from Identity Theft in 2023 appeared first on Data Science Central.  ( 22 min )
    What is a Good Net Promoter Score for the Hotel/Resort Industry?
    The hotel industry is competitive, and it is solely dependent on customer satisfaction. Customers are key.  The hotel industry knows this and the importance of the NPS score for customer satisfaction. A better NPS score means satisfied/loyal customers.  What hotels have in their control is the website user interface, menu, and providing a seamless customer… Read More »What is a Good Net Promoter Score for the Hotel/Resort Industry? The post What is a Good Net Promoter Score for the Hotel/Resort Industry? appeared first on Data Science Central.  ( 21 min )
    6 Benefits of Data Science for Your Business
    We would not discover a new planet if we claimed that modern business harnesses the power of data science. Data science is used for a variety of purposes in a variety of industries. Furthermore, we would like to discuss the benefits of Data science for business in general. But before that, let’s define what Data… Read More »6 Benefits of Data Science for Your Business The post 6 Benefits of Data Science for Your Business appeared first on Data Science Central.  ( 22 min )
    7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023
    By 2025, more than 80% of enterprises will shift from traditional data centers to the cloud or third-party colocation data centers. For most businesses, data is an irreplaceable asset and a key investment area for future growth. Virtual colocation is becoming the talk of how data centers are shifting to adapt to growing business environments.… Read More »7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023 The post 7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023 appeared first on Data Science Central.  ( 20 min )
  • Open

    Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK
    Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) partly based on JupyterLab 3. Studio provides a web-based interface to interactively perform ML development tasks required to prepare data and build, train, and deploy ML models. In Studio, you can load data, adjust ML models, move in between steps to adjust experiments, […]  ( 6 min )
    Churn prediction using multimodality of text and tabular features with Amazon SageMaker Jumpstart
    Amazon SageMaker JumpStart is the Machine Learning (ML) hub of SageMaker providing pre-trained, publicly available models for a wide range of problem types to help you get started with machine learning. Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. Customer churn is […]  ( 14 min )
  • Open

    🚀Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
  • Open

    NVIDIA and Dell Technologies Expand AI Portfolio
    In their largest-ever joint AI initiative, NVIDIA and Dell Technologies today launched a wave of Dell PowerEdge systems available with NVIDIA acceleration, enabling enterprises to efficiently transform their businesses with AI. A total of 15 next-generation Dell PowerEdge systems can draw from NVIDIA’s full AI stack — including GPUs, DPUs and the NVIDIA AI Enterprise Read article >  ( 5 min )
  • Open

    Special primality proofs
    I’ve written lately about two general ways to prove that a number is prime: Pratt certificates for moderately-large primes and elliptic curve certificates for very large primes. If you can say more about the prime you wish to certify, there may be special forms of certificates that are more efficient. In particular, there are efficient […] Special primality proofs first appeared on John D. Cook.  ( 5 min )
  • Open

    Natural Language Processing of Aviation Occurrence Reports for Safety Management. (arXiv:2301.05663v1 [cs.CL])
    Occurrence reporting is a commonly used method in safety management systems to obtain insight in the prevalence of hazards and accident scenarios. In support of safety data analysis, reports are often categorized according to a taxonomy. However, the processing of the reports can require significant effort from safety analysts and a common problem is interrater variability in labeling processes. Also, in some cases, reports are not processed according to a taxonomy, or the taxonomy does not fully cover the contents of the documents. This paper explores various Natural Language Processing (NLP) methods to support the analysis of aviation safety occurrence reports. In particular, the problems studied are the automatic labeling of reports using a classification model, extracting the latent topics in a collection of texts using a topic model and the automatic generation of probable cause texts. Experimental results showed that (i) under the right conditions the labeling of occurrence reports can be effectively automated with a transformer-based classifier, (ii) topic modeling can be useful for finding the topics present in a collection of reports, and (iii) using a summarization model can be a promising direction for generating probable cause texts.  ( 2 min )
    Short-time SSVEP data extension by a novel generative adversarial networks based framework. (arXiv:2301.05599v1 [q-bio.NC])
    Steady-state visual evoked potentials (SSVEPs) based brain-computer interface (BCI) has received considerable attention due to its high transfer rate and available quantity of targets. However, the performance of frequency identification methods heavily hinges on the amount of user calibration data and data length, which hinders the deployment in real-world applications. Recently, generative adversarial networks (GANs)-based data generation methods have been widely adopted to create supplementary synthetic electroencephalography (EEG) data, holds promise to address these issues. In this paper, we proposed a GAN-based end-to-end signal transformation network for data length window extension, termed as TEGAN. TEGAN transforms short-time SSVEP signals into long-time artificial SSVEP signals. By incorporating a novel U-Net generator architecture and auxiliary classifier into the network design, the TEGAN could produce conditioned features in the synthetic data. Additionally, to regularize the training process of GAN, we introduced a two-stage training strategy and the LeCam-divergence regularization term during the network implementation. The proposed TEGAN was evaluated on two public SSVEP datasets. With the assistance of TEGAN, the performance of traditional frequency recognition methods and deep learning-based methods have been significantly improved under limited calibration data. This study substantiates the feasibility of the proposed method to extend the data length for short-time SSVEP signals to develop a high-performance BCI system. The proposed GAN-based methods have the great potential of shortening the calibration time for various real-world BCI-based applications, while the novelty of our augmentation strategies shed some value light on understanding the subject-invariant properties of SSVEPs.  ( 2 min )
    Sparse deep neural networks for modeling aluminum electrolysis dynamics. (arXiv:2209.05832v2 [physics.chem-ph] UPDATED)
    Deep neural networks have become very popular in modeling complex nonlinear processes due to their extraordinary ability to fit arbitrary nonlinear functions from data with minimal expert intervention. However, they are almost always overparameterized and challenging to interpret due to their internal complexity. Furthermore, the optimization process to find the learned model parameters can be unstable due to the process getting stuck in local minima. In this work, we demonstrate the value of sparse regularization techniques to significantly reduce the model complexity. We demonstrate this for the case of an aluminium extraction process, which is highly nonlinear system with many interrelated subprocesses. We trained a densely connected deep neural network to model the process and then compared the effects of sparsity promoting l1 regularization on generalizability, interpretability, and training stability. We found that the regularization significantly reduces model complexity compared to a corresponding dense neural network. We argue that this makes the model more interpretable, and show that training an ensemble of sparse neural networks with different parameter initializations often converges to similar model structures with similar learned input features. Furthermore, the empirical study shows that the resulting sparse models generalize better from small training sets than their dense counterparts.  ( 2 min )
    Fully Adaptive Composition in Differential Privacy. (arXiv:2203.05481v2 [cs.LG] UPDATED)
    Composition is a key feature of differential privacy. Well-known advanced composition theorems allow one to query a private database quadratically more times than basic privacy composition would permit. However, these results require that the privacy parameters of all algorithms be fixed before interacting with the data. To address this, Rogers et al. introduced fully adaptive composition, wherein both algorithms and their privacy parameters can be selected adaptively. The authors introduce two probabilistic objects to measure privacy in adaptive composition: privacy filters, which provide differential privacy guarantees for composed interactions, and privacy odometers, time-uniform bounds on privacy loss. There are substantial gaps between advanced composition and existing filters and odometers. First, existing filters place stronger assumptions on the algorithms being composed. Second, these odometers and filters suffer from large constants, making them impractical. We construct filters that match the tightness of advanced composition, including constants, despite allowing for adaptively chosen privacy parameters. En route we also derive a privacy filter for approximate zCDP and approximate RDP. We also construct several general families of odometers. These odometers can match the tightness of advanced composition at an arbitrary, preselected point in time, or at all points in time simultaneously, up to a doubly-logarithmic factor. We obtain our results by leveraging recent advances in time-uniform martingale concentration. In sum, we show that fully adaptive privacy is obtainable at almost no loss, and conjecture that our results are essentially unimprovable (even in constants) in general.  ( 2 min )
    NRBdMF: A recommendation algorithm for predicting drug effects considering directionality. (arXiv:2208.04312v2 [q-bio.QM] UPDATED)
    Predicting the novel effects of drugs based on information about approved drugs can be regarded as a recommendation system. Matrix factorization is one of the most used recommendation systems and various algorithms have been devised for it. A literature survey and summary of existing algorithms for predicting drug effects demonstrated that most such methods, including neighborhood regularized logistic matrix factorization, which was the best performer in benchmark tests, used a binary matrix that considers only the presence or absence of interactions. However, drug effects are known to have two opposite aspects, such as side effects and therapeutic effects. In the present study, we proposed using neighborhood regularized bidirectional matrix factorization (NRBdMF) to predict drug effects by incorporating bidirectionality, which is a characteristic property of drug effects. We used this proposed method for predicting side effects using a matrix that considered the bidirectionality of drug effects, in which known side effects were assigned a positive label (plus 1) and known treatment effects were assigned a negative (minus 1) label. The NRBdMF model, which utilizes drug bidirectional information, achieved enrichment of side effects at the top and indications at the bottom of the prediction list. This first attempt to consider the bidirectional nature of drug effects using NRBdMF showed that it reduced false positives and produced a highly interpretable output.  ( 2 min )
    Are disentangled representations all you need to build speaker anonymization systems?. (arXiv:2208.10497v3 [cs.SD] UPDATED)
    Speech signals contain a lot of sensitive information, such as the speaker's identity, which raises privacy concerns when speech data get collected. Speaker anonymization aims to transform a speech signal to remove the source speaker's identity while leaving the spoken content unchanged. Current methods perform the transformation by relying on content/speaker disentanglement and voice conversion. Usually, an acoustic model from an automatic speech recognition system extracts the content representation while an x-vector system extracts the speaker representation. Prior work has shown that the extracted features are not perfectly disentangled. This paper tackles how to improve features disentanglement, and thus the converted anonymized speech. We propose enhancing the disentanglement by removing speaker information from the acoustic model using vector quantization. Evaluation done using the VoicePrivacy 2022 toolkit showed that vector quantization helps conceal the original speaker identity while maintaining utility for speech recognition.  ( 2 min )
    Locating and Editing Factual Associations in GPT. (arXiv:2202.05262v5 [cs.CL] UPDATED)
    We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/  ( 2 min )
    On the feasibility of attacking Thai LPR systems with adversarial examples. (arXiv:2301.05506v1 [cs.CR])
    Recent advances in deep neural networks (DNNs) have significantly enhanced the capabilities of optical character recognition (OCR) technology, enabling its adoption to a wide range of real-world applications. Despite this success, DNN-based OCR is shown to be vulnerable to adversarial attacks, in which the adversary can influence the DNN model's prediction by carefully manipulating input to the model. Prior work has demonstrated the security impacts of adversarial attacks on various OCR languages. However, to date, no studies have been conducted and evaluated on an OCR system tailored specifically for the Thai language. To bridge this gap, this work presents a feasibility study of performing adversarial attacks on a specific Thai OCR application -- Thai License Plate Recognition (LPR). Moreover, we propose a new type of adversarial attack based on the \emph{semi-targeted} scenario and show that this scenario is highly realistic in LPR applications. Our experimental results show the feasibility of our attacks as they can be performed on a commodity computer desktop with over 90% attack success rate.  ( 2 min )
    An Offset-Free Nonlinear MPC scheme for systems learned by Neural NARX models. (arXiv:2203.16290v4 [eess.SY] UPDATED)
    This paper deals with the design of nonlinear MPC controllers that provide offset-free setpoint tracking for models described by Neural Nonlinear AutoRegressive eXogenous (NNARX) networks. The NNARX model is identified from input-output data collected from the plant, and can be given a state-space representation with known measurable states made by past input and output variables, so that a state observer is not required. In the training phase, the Incremental Input-to-State Stability ({\delta}ISS) property can be forced when consistent with the behavior of the plant. The {\delta}ISS property is then leveraged to augment the model with an explicit integral action on the output tracking error, which allows to achieve offset-free tracking capabilities to the designed control scheme. The proposed control architecture is numerically tested on a water heating system and the achieved results are compared to those scored by another popular offset-free MPC method, showing that the proposed scheme attains remarkable performances even in presence of disturbances acting on the plant.  ( 2 min )
    Discrete Morse Sandwich: Fast Computation of Persistence Diagrams for Scalar Data -- An Algorithm and A Benchmark. (arXiv:2206.13932v2 [cs.LG] UPDATED)
    This paper introduces an efficient algorithm for persistence diagram computation, given an input piecewise linear scalar field $f$ defined on a $d$-dimensional simplicial complex $K$, with $d \leq 3$. Our work revisits the seminal algorithm "PairSimplices" [31], [103] with discrete Morse theory (DMT) [34], [80], which greatly reduces the number of input simplices to consider. Further, we also extend to DMT and accelerate the stratification strategy described in "PairSimplices" for the fast computation of the $0^{th}$ and $(d - 1)^{th}$ diagrams, noted $D_0(f)$ and $D_{d-1}(f)$. Minima-saddle persistence pairs ($D_0(f)$) and saddle-maximum persistence pairs ($D_{d-1}(f)$) are efficiently computed by processing, with a Union-Find, the unstable sets of $1$-saddles and the stable sets of $(d - 1)$-saddles. This fast pre-computation for the dimensions $0$ and $(d - 1)$ enables an aggressive specialization of [4] to the 3D case, which results in a drastic reduction of the number of input simplices for the computation of $D_1(f)$, the intermediate layer of the sandwich. Finally, we document several performance improvements via shared-memory parallelism. We provide an open-source implementation of our algorithm for reproducibility purposes. We also contribute a reproducible benchmark package, which exploits three-dimensional data from a public repository and compares our algorithm to a variety of publicly available implementations. Extensive experiments indicate that our algorithm improves by two orders of magnitude the time performance of the seminal "PairSimplices" algorithm it extends. Moreover, it also improves memory footprint and time performance over a selection of 14 competing approaches, with a substantial gain over the fastest available approaches, while producing a strictly identical output.  ( 3 min )
    Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces. (arXiv:2301.05525v1 [cs.LG])
    Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.  ( 2 min )
    Competing Bandits in Time Varying Matching Markets. (arXiv:2210.11692v2 [cs.LG] UPDATED)
    We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying and unknown. Our contribution is a methodology that can handle any type of preference structure and variation scenario. We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of {$\widetilde{\mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in the underlying preferences of the agents, $L_T$. Therefore, we show that the optimal rates for single-agent learning can be achieved in spite of the competition up to a difference of a constant factor. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.  ( 2 min )
    OpenTwins: An open-source framework for the design, development and integration of effective 3D-IoT-AI-powered digital twins. (arXiv:2301.05560v1 [cs.SE])
    Although digital twins have recently emerged as a clear alternative for reliable asset representations, most of the solutions and tools available for the development of digital twins are tailored to specific environments. Furthermore, achieving reliable digital twins often requires the orchestration of technologies and paradigms such as machine learning, the Internet of Things, and 3D visualization, which are rarely seamlessly aligned. In this paper, we present a generic framework for the development of effective digital twins combining some of the aforementioned areas. In this open framework, digital twins can be easily developed and orchestrated with 3D connected visualizations, IoT data streams, and real-time machine-learning predictions. To demonstrate the feasibility of the framework, a use case in the Petrochemical Industry 4.0 has been developed.  ( 2 min )
    Designing losses for data-free training of normalizing flows on Boltzmann distributions. (arXiv:2301.05475v1 [cs.LG])
    Generating a Boltzmann distribution in high dimension has recently been achieved with Normalizing Flows, which enable fast and exact computation of the generated density, and thus unbiased estimation of expectations. However, current implementations rely on accurate training data, which typically comes from computationally expensive simulations. There is therefore a clear incentive to train models with incomplete or no data by relying solely on the target density, which can be obtained from a physical energy model (up to a constant factor). For that purpose, we analyze the properties of standard losses based on Kullback-Leibler divergences. We showcase their limitations, in particular a strong propensity for mode collapse during optimization on high-dimensional distributions. We then propose strategies to alleviate these issues, most importantly a new loss function well-grounded in theory and with suitable optimization properties. Using as a benchmark the generation of 3D molecular configurations, we show on several tasks that, for the first time, imperfect pre-trained models can be further optimized in the absence of training data.  ( 2 min )
    Composite model of seismic monitoring data analysis during mining operations on the example of the Kukisvumchorrskoye deposit of JSC Apatit. (arXiv:2301.05701v1 [physics.geo-ph])
    Geomechanical monitoring of a rock massif is an actively developing branch of geomechanics. It is almost impossible to single out a methodology and approaches for data collection and analysis in developing seismic monitoring systems. In the process of mining in rock massif, changes in the state of structural inhomogeneities are most clearly manifested. Existing natural structural inhomogeneities are revealed, there are movements in discontinuous disturbances, and new technogenic disturbances are formed, which are accompanied by a change in the natural stress state of various blocks of the massif. An important task is to develop a mining forecasting model that can take into account the structural heterogeneity of the rock massif and select the necessary forecast horizon depending on monitoring data The developed method of evaluating the results of monitoring geomechanical processes in the rock massif allowed us to forecast of zones of possible rock bursts.  ( 2 min )
    A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. (arXiv:2301.05339v1 [cs.GR])
    Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.  ( 2 min )
    Communication-Efficient Distributionally Robust Decentralized Learning. (arXiv:2205.15614v3 [cs.LG] UPDATED)
    Decentralized learning algorithms empower interconnected devices to share data and computational resources to collaboratively train a machine learning model without the aid of a central coordinator. In the case of heterogeneous data distributions at the network nodes, collaboration can yield predictors with unsatisfactory performance for a subset of the devices. For this reason, in this work, we consider the formulation of a distributionally robust decentralized learning task and we propose a decentralized single loop gradient descent/ascent algorithm (AD-GDA) to directly solve the underlying minimax optimization problem. We render our algorithm communication-efficient by employing a compressed consensus scheme and we provide convergence guarantees for smooth convex and non-convex loss functions. Finally, we corroborate the theoretical findings with empirical results that highlight AD-GDA's ability to provide unbiased predictors and to greatly improve communication efficiency compared to existing distributionally robust algorithms.  ( 2 min )
    Personalized Prompt Learning for Explainable Recommendation. (arXiv:2202.07371v2 [cs.IR] UPDATED)
    Providing user-understandable explanations to justify recommendations could help users better understand the recommended items, increase the system's ease of use, and gain users' trust. A typical approach to realize it is natural language generation. However, previous works mostly adopt recurrent neural networks to meet the ends, leaving the potentially more effective pre-trained Transformer models under-explored. In fact, user and item IDs, as important identifiers in recommender systems, are inherently in different semantic space as words that pre-trained models were already trained on. Thus, how to effectively fuse IDs into such models becomes a critical issue. Inspired by recent advancement in prompt learning, we come up with two solutions: find alternative words to represent IDs (called discrete prompt learning), and directly input ID vectors to a pre-trained model (termed continuous prompt learning). In the latter case, ID vectors are randomly initialized but the model is trained in advance on large corpora, so they are actually in different learning stages. To bridge the gap, we further propose two training strategies: sequential tuning and recommendation as regularization. Extensive experiments show that our continuous prompt learning approach equipped with the training strategies consistently outperforms strong baselines on three datasets of explainable recommendation.  ( 2 min )
    On the Symmetries of Deep Learning Models and their Internal Representations. (arXiv:2205.14258v4 [cs.LG] UPDATED)
    Symmetry is a fundamental tool in the exploration of a broad range of complex systems. In machine learning symmetry has been explored in both models and data. In this paper we seek to connect the symmetries arising from the architecture of a family of models with the symmetries of that family's internal representation of data. We do this by calculating a set of fundamental symmetry groups, which we call the intertwiner groups of the model. We connect intertwiner groups to a model's internal representations of data through a range of experiments that probe similarities between hidden states across models with the same architecture. Our work suggests that the symmetries of a network are propagated into the symmetries in that network's representation of data, providing us with a better understanding of how architecture affects the learning and prediction process. Finally, we speculate that for ReLU networks, the intertwiner groups may provide a justification for the common practice of concentrating model interpretability exploration on the activation basis in hidden layers rather than arbitrary linear combinations thereof.  ( 2 min )
    An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms. (arXiv:2208.03247v2 [cs.LG] UPDATED)
    In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace) that uses multi-step return and generalized importance sampling factors, and provide the finite-sample analysis. Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(\epsilon^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.  ( 2 min )
    Learning to Control and Coordinate Hybrid Traffic Through Robot Vehicles at Complex and Unsignalized Intersections. (arXiv:2301.05294v1 [cs.LG])
    Intersections are essential road infrastructures for traffic in modern metropolises; however, they can also be the bottleneck of traffic flows due to traffic incidents or the absence of traffic coordination mechanisms such as traffic lights. Thus, various control and coordination mechanisms that are beyond traditional control methods have been proposed to improve the efficiency of intersection traffic. Amongst these methods, the control of foreseeable hybrid traffic that consists of human-driven vehicles (HVs) and robot vehicles (RVs) has recently emerged. We propose a decentralized reinforcement learning approach for the control and coordination of hybrid traffic at real-world, complex intersections--a topic that has not been previously explored. Comprehensive experiments are conducted to show the effectiveness of our approach. In particular, we show that using 5% RVs, we can prevent congestion formation inside the intersection under the actual traffic demand of 700 vehicles per hour. In contrast, without RVs, congestion starts to develop when the traffic demand reaches as low as 200 vehicles per hour. Further performance gains (reduced waiting time of vehicles at the intersection) are obtained as the RV penetration rate increases. When there exist more than 50% RVs in traffic, our method starts to outperform traffic signals on the average waiting time of all vehicles at the intersection. Our method is also robust against both blackout events and sudden RV percentage drops, and enjoys excellent generalizablility, which is illustrated by its successful deployment in two unseen intersections.  ( 2 min )
    confidence-planner: Easy-to-Use Prediction Confidence Estimation and Sample Size Planning. (arXiv:2301.05702v1 [stat.ME])
    Machine learning applications, especially in the fields of me\-di\-cine and social sciences, are slowly being subjected to increasing scrutiny. Similarly to sample size planning performed in clinical and social studies, lawmakers and funding agencies may expect statistical uncertainty estimations in machine learning applications that impact society. In this paper, we present an easy-to-use python package and web application for estimating prediction confidence intervals. The package offers eight different procedures to determine and justify the sample size and confidence of predictions from holdout, bootstrap, cross-validation, and progressive validation experiments. Since the package builds directly on established data analysis libraries, it seamlessly integrates into preprocessing and exploratory data analysis steps. Code related to this paper is available at: https://github.com/dabrze/confidence-planner.  ( 2 min )
    On the infinite-depth limit of finite-width neural networks. (arXiv:2210.00688v3 [stat.ML] UPDATED)
    In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing change of regime phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width and compare it with the more commonly studied infinite-width-then-infinite-depth limit.  ( 2 min )
    TarGF: Learning Target Gradient Field for Object Rearrangement. (arXiv:2209.00853v3 [cs.LG] UPDATED)
    Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability.  ( 2 min )
    Mutation Testing of Deep Reinforcement Learning Based on Real Faults. (arXiv:2301.05651v1 [cs.LG])
    Testing Deep Learning (DL) systems is a complex task as they do not behave like traditional systems would, notably because of their stochastic nature. Nonetheless, being able to adapt existing testing techniques such as Mutation Testing (MT) to DL settings would greatly improve their potential verifiability. While some efforts have been made to extend MT to the Supervised Learning paradigm, little work has gone into extending it to Reinforcement Learning (RL) which is also an important component of the DL ecosystem but behaves very differently from SL. This paper builds on the existing approach of MT in order to propose a framework, RLMutation, for MT applied to RL. Notably, we use existing taxonomies of faults to build a set of mutation operators relevant to RL and use a simple heuristic to generate test cases for RL. This allows us to compare different mutation killing definitions based on existing approaches, as well as to analyze the behavior of the obtained mutation operators and their potential combinations called Higher Order Mutation(s) (HOM). We show that the design choice of the mutation killing definition can affect whether or not a mutation is killed as well as the generated test cases. Moreover, we found that even with a relatively small number of test cases and operators we manage to generate HOM with interesting properties which can enhance testing capability in RL systems.  ( 2 min )
    Universally Expressive Communication in Multi-Agent Reinforcement Learning. (arXiv:2206.06758v3 [cs.MA] UPDATED)
    Allowing agents to share information through communication is crucial for solving complex tasks in multi-agent reinforcement learning. In this work, we consider the question of whether a given communication protocol can express an arbitrary policy. By observing that many existing protocols can be viewed as instances of graph neural networks (GNNs), we demonstrate the equivalence of joint action selection to node labelling. With standard GNN approaches provably limited in their expressive capacity, we draw from existing GNN literature and consider augmenting agent observations with: (1) unique agent IDs and (2) random noise. We provide a theoretical analysis as to how these approaches yield universally expressive communication, and also prove them capable of targeting arbitrary sets of actions for identical agents. Empirically, these augmentations are found to improve performance on tasks where expressive communication is required, whilst, in general, the optimal communication protocol is found to be task-dependent.  ( 2 min )
    MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control. (arXiv:2208.07363v3 [cs.RO] UPDATED)
    Simulated humanoids are an appealing research domain due to their physical capabilities. Nonetheless, they are also challenging to control, as a policy must drive an unstable, discontinuous, and high-dimensional physical system. One widely studied approach is to utilize motion capture (MoCap) data to teach the humanoid agent low-level skills (e.g., standing, walking, and running) that can then be re-used to synthesize high-level behaviors. However, even with MoCap data, controlling simulated humanoids remains very hard, as MoCap data offers only kinematic information. Finding physical control inputs to realize the demonstrated motions requires computationally intensive methods like reinforcement learning. Thus, despite the publicly available MoCap data, its utility has been limited to institutions with large-scale compute. In this work, we dramatically lower the barrier for productive research on this topic by training and releasing high-quality agents that can track over three hours of MoCap data for a simulated humanoid in the dm_control physics-based environment. We release MoCapAct (Motion Capture with Actions), a dataset of these expert agents and their rollouts, which contain proprioceptive observations and actions. We demonstrate the utility of MoCapAct by using it to train a single hierarchical policy capable of tracking the entire MoCap dataset within dm_control and show the learned low-level component can be re-used to efficiently learn downstream high-level tasks. Finally, we use MoCapAct to train an autoregressive GPT model and show that it can control a simulated humanoid to perform natural motion completion given a motion prompt. Videos of the results and links to the code and dataset are available at https://microsoft.github.io/MoCapAct.  ( 2 min )
    Explicit Temporal Embedding in Deep Generative Latent Models for Longitudinal Medical Image Synthesis. (arXiv:2301.05465v1 [cs.CV])
    Medical imaging plays a vital role in modern diagnostics and treatment. The temporal nature of disease or treatment progression often results in longitudinal data. Due to the cost and potential harm, acquiring large medical datasets necessary for deep learning can be difficult. Medical image synthesis could help mitigate this problem. However, until now, the availability of GANs capable of synthesizing longitudinal volumetric data has been limited. To address this, we use the recent advances in latent space-based image editing to propose a novel joint learning scheme to explicitly embed temporal dependencies in the latent space of GANs. This, in contrast to previous methods, allows us to synthesize continuous, smooth, and high-quality longitudinal volumetric data with limited supervision. We show the effectiveness of our approach on three datasets containing different longitudinal dependencies. Namely, modeling a simple image transformation, breathing motion, and tumor regression, all while showing minimal disentanglement. The implementation is made available online at https://github.com/julschoen/Temp-GAN.  ( 2 min )
    Feature Importance Guided Attack: A Model Agnostic Adversarial Attack. (arXiv:2106.14815v3 [cs.LG] UPDATED)
    Research in adversarial learning has primarily focused on homogeneous unstructured datasets, which often map into the problem space naturally. Inverting a feature space attack on heterogeneous datasets into the problem space is much more challenging, particularly the task of finding the perturbation to perform. This work presents a formal search strategy: the `Feature Importance Guided Attack' (FIGA), which finds perturbations in the feature space of heterogeneous tabular datasets to produce evasion attacks. We first demonstrate FIGA in the feature space and then in the problem space. FIGA assumes no prior knowledge of the defending model's learning algorithm and does not require any gradient information. FIGA assumes knowledge of the feature representation and the mean feature values of defending model's dataset. FIGA leverages feature importance rankings by perturbing the most important features of the input in the direction of the target class. While FIGA is conceptually similar to other work which uses feature selection processes (e.g., mimicry attacks), we formalize an attack algorithm with three tunable parameters and investigate the strength of FIGA on tabular datasets. We demonstrate the effectiveness of FIGA by evading phishing detection models trained on four different tabular phishing datasets and one financial dataset with an average success rate of 94%. We extend FIGA to the phishing problem space by limiting the possible perturbations to be valid and feasible in the phishing domain. We generate valid adversarial phishing sites that are visually identical to their unperturbed counterpart and use them to attack six tabular ML models achieving a 13.05% average success rate.  ( 2 min )
    Adam Can Converge Without Any Modification On Update Rules. (arXiv:2208.09632v5 [cs.LG] UPDATED)
    Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $\beta_2$ is large and $\beta_1 < \sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $\beta_2$ increases, our convergence result can cover any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$ is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $\beta_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.  ( 3 min )
    Hierarchical Deep Q-Learning Based Handover in Wireless Networks with Dual Connectivity. (arXiv:2301.05391v1 [cs.NI])
    5G New Radio proposes the usage of frequencies above 10 GHz to speed up LTE's existent maximum data rates. However, the effective size of 5G antennas and consequently its repercussions in the signal degradation in urban scenarios makes it a challenge to maintain stable coverage and connectivity. In order to obtain the best from both technologies, recent dual connectivity solutions have proved their capabilities to improve performance when compared with coexistent standalone 5G and 4G technologies. Reinforcement learning (RL) has shown its huge potential in wireless scenarios where parameter learning is required given the dynamic nature of such context. In this paper, we propose two reinforcement learning algorithms: a single agent RL algorithm named Clipped Double Q-Learning (CDQL) and a hierarchical Deep Q-Learning (HiDQL) to improve Multiple Radio Access Technology (multi-RAT) dual-connectivity handover. We compare our proposal with two baselines: a fixed parameter and a dynamic parameter solution. Simulation results reveal significant improvements in terms of latency with a gain of 47.6% and 26.1% for Digital-Analog beamforming (BF), 17.1% and 21.6% for Hybrid-Analog BF, and 24.7% and 39% for Analog-Analog BF when comparing the RL-schemes HiDQL and CDQL with the with the existent solutions, HiDQL presented a slower convergence time, however obtained a more optimal solution than CDQL. Additionally, we foresee the advantages of utilizing context-information as geo-location of the UEs to reduce the beam exploration sector, and thus improving further multi-RAT handover latency results.  ( 2 min )
    Multi-Target Landmark Detection with Incomplete Images via Reinforcement Learning and Shape Prior. (arXiv:2301.05392v1 [cs.CV])
    Medical images are generally acquired with limited field-of-view (FOV), which could lead to incomplete regions of interest (ROI), and thus impose a great challenge on medical image analysis. This is particularly evident for the learning-based multi-target landmark detection, where algorithms could be misleading to learn primarily the variation of background due to the varying FOV, failing the detection of targets. Based on learning a navigation policy, instead of predicting targets directly, reinforcement learning (RL)-based methods have the potential totackle this challenge in an efficient manner. Inspired by this, in this work we propose a multi-agent RL framework for simultaneous multi-target landmark detection. This framework is aimed to learn from incomplete or (and) complete images to form an implicit knowledge of global structure, which is consolidated during the training stage for the detection of targets from either complete or incomplete test images. To further explicitly exploit the global structural information from incomplete images, we propose to embed a shape model into the RL process. With this prior knowledge, the proposed RL model can not only localize dozens of targetssimultaneously, but also work effectively and robustly in the presence of incomplete images. We validated the applicability and efficacy of the proposed method on various multi-target detection tasks with incomplete images from practical clinics, using body dual-energy X-ray absorptiometry (DXA), cardiac MRI and head CT datasets. Results showed that our method could predict whole set of landmarks with incomplete training images up to 80% missing proportion (average distance error 2.29 cm on body DXA), and could detect unseen landmarks in regions with missing image information outside FOV of target images (average distance error 6.84 mm on 3D half-head CT).  ( 2 min )
    A Generic Graph Sparsification Framework using Deep Reinforcement Learning. (arXiv:2112.01565v2 [cs.LG] UPDATED)
    The interconnectedness and interdependence of modern graphs are growing ever more complex, causing enormous resources for processing, storage, communication, and decision-making of these graphs. In this work, we focus on the task graph sparsification: an edge-reduced graph of a similar structure to the original graph is produced while various user-defined graph metrics are largely preserved. Existing graph sparsification methods are mostly sampling-based, which introduce high computation complexity in general and lack of flexibility for a different reduction objective. We present SparRL, the first generic and effective graph sparsification framework enabled by deep reinforcement learning. SparRL can easily adapt to different reduction goals and promise graph-size-independent complexity. Extensive experiments show that SparRL outperforms all prevailing sparsification methods in producing high-quality sparsified graphs concerning a variety of objectives.  ( 2 min )
    Co-manipulation of soft-materials estimating deformation from depth images. (arXiv:2301.05609v1 [cs.RO])
    Human-robot co-manipulation of soft materials, such as fabrics, composites, and sheets of paper/cardboard, is a challenging operation that presents several relevant industrial applications. Estimating the deformation state of the co-manipulated material is one of the main challenges. Viable methods provide the indirect measure by calculating the human-robot relative distance. In this paper, we develop a data-driven model to estimate the deformation state of the material from a depth image through a Convolutional Neural Network (CNN). First, we define the deformation state of the material as the relative roto-translation from the current robot pose and a human grasping position. The model estimates the current deformation state through a Convolutional Neural Network, specifically a DenseNet-121 pretrained on ImageNet.The delta between the current and the desired deformation state is fed to the robot controller that outputs twist commands. The paper describes the developed approach to acquire, preprocess the dataset and train the model. The model is compared with the current state-of-the-art method based on a skeletal tracker from cameras. Results show that our approach achieves better performances and avoids the various drawbacks caused by using a skeletal tracker.Finally, we also studied the model performance according to different architectures and dataset dimensions to minimize the time required for dataset acquisition  ( 2 min )
    Time-Myopic Go-Explore: Learning A State Representation for the Go-Explore Paradigm. (arXiv:2301.05635v1 [cs.LG])
    Very large state spaces with a sparse reward signal are difficult to explore. The lack of a sophisticated guidance results in a poor performance for numerous reinforcement learning algorithms. In these cases, the commonly used random exploration is often not helpful. The literature shows that this kind of environments require enormous efforts to systematically explore large chunks of the state space. Learned state representations can help here to improve the search by providing semantic context and build a structure on top of the raw observations. In this work we introduce a novel time-myopic state representation that clusters temporal close states together while providing a time prediction capability between them. By adapting this model to the Go-Explore paradigm (Ecoffet et al., 2021b), we demonstrate the first learned state representation that reliably estimates novelty instead of using the hand-crafted representation heuristic. Our method shows an improved solution for the detachment problem which still remains an issue at the Go-Explore Exploration Phase. We provide evidence that our proposed method covers the entire state space with respect to all possible time trajectories without causing disadvantageous conflict-overlaps in the cell archive. Analogous to native Go-Explore, our approach is evaluated on the hard exploration environments MontezumaRevenge, Gravitar and Frostbite (Atari) in order to validate its capabilities on difficult tasks. Our experiments show that time-myopic Go-Explore is an effective alternative for the domain-engineered heuristic while also being more general. The source code of the method is available on GitHub.  ( 2 min )
    A Novel Framework for Handling Sparse Data in Traffic Forecast. (arXiv:2301.05292v1 [cs.LG])
    The ever increasing amount of GPS-equipped vehicles provides in real-time valuable traffic information for the roads traversed by the moving vehicles. In this way, a set of sparse and time evolving traffic reports is generated for each road. These time series are a valuable asset in order to forecast the future traffic condition. In this paper we present a deep learning framework that encodes the sparse recent traffic information and forecasts the future traffic condition. Our framework consists of a recurrent part and a decoder. The recurrent part employs an attention mechanism that encodes the traffic reports that are available at a particular time window. The decoder is responsible to forecast the future traffic condition.  ( 2 min )
    Dynamic Data Assimilation of MPAS-O and the Global Drifter Dataset. (arXiv:2301.05551v1 [physics.ao-ph])
    In this study, we propose a new method for combining in situ buoy measurements with Earth system models (ESMs) to improve the accuracy of temperature predictions in the ocean. The technique utilizes the dynamics and modes identified in ESMs to improve the accuracy of buoy measurements while still preserving features such as seasonality. Using this technique, errors in localized temperature predictions made by the MPAS-O model can be corrected. We demonstrate that our approach improves accuracy compared to other interpolation and data assimilation methods. We apply our method to assimilate the Model for Prediction Across Scales Ocean component (MPAS-O) with the Global Drifter Program's in-situ ocean buoy dataset.  ( 2 min )
    LVRNet: Lightweight Image Restoration for Aerial Images under Low Visibility. (arXiv:2301.05434v1 [cs.CV])
    Learning to recover clear images from images having a combination of degrading factors is a challenging task. That being said, autonomous surveillance in low visibility conditions caused by high pollution/smoke, poor air quality index, low light, atmospheric scattering, and haze during a blizzard becomes even more important to prevent accidents. It is thus crucial to form a solution that can result in a high-quality image and is efficient enough to be deployed for everyday use. However, the lack of proper datasets available to tackle this task limits the performance of the previous methods proposed. To this end, we generate the LowVis-AFO dataset, containing 3647 paired dark-hazy and clear images. We also introduce a lightweight deep learning model called Low-Visibility Restoration Network (LVRNet). It outperforms previous image restoration methods with low latency, achieving a PSNR value of 25.744 and an SSIM of 0.905, making our approach scalable and ready for practical use. The code and data can be found at https://github.com/Achleshwar/LVRNet.  ( 2 min )
    A Comprehensive Survey to Dataset Distillation. (arXiv:2301.05603v1 [cs.LG])
    Deep learning technology has unprecedentedly developed in the last decade and has become the primary choice in many application domains. This progress is mainly attributed to a systematic collaboration that rapidly growing computing resources encourage advanced algorithms to deal with massive data. However, it gradually becomes challenging to cope with the unlimited growth of data with limited computing power. To this end, diverse approaches are proposed to improve data processing efficiency. Dataset distillation, one of the dataset reduction methods, tackles the problem via synthesising a small typical dataset from giant data and has attracted a lot of attention from the deep learning community. Existing dataset distillation methods can be taxonomised into meta-learning and data match framework according to whether explicitly mimic target data. Albeit dataset distillation has shown a surprising performance in compressing datasets, it still possesses several limitations such as distilling high-resolution data. This paper provides a holistic understanding of dataset distillation from multiple aspects, including distillation frameworks and algorithms, disentangled dataset distillation, performance comparison, and applications. Finally, we discuss challenges and promising directions to further promote future studies about dataset distillation.  ( 2 min )
    Predictions of photophysical properties of phosphorescent platinum(II) complexes based on ensemble machine learning approach. (arXiv:2301.05639v1 [cs.LG])
    Phosphorescent metal complexes have been under intense investigations as emissive dopants for energy efficient organic light emitting diodes (OLEDs). Among them, cyclometalated Pt(II) complexes are widespread triplet emitters with color-tunable emissions. To render their practical applications as OLED emitters, it is in great need to develop Pt(II) complexes with high radiative decay rate constant ($k_r$) and photoluminescence (PL) quantum yield. Thus, an efficient and accurate prediction tool is highly desirable. Here, we develop a general protocol for accurate predictions of emission wavelength, radiative decay rate constant, and PL quantum yield for phosphorescent Pt(II) emitters based on the combination of first-principles quantum mechanical method, machine learning (ML) and experimental calibration. A new dataset concerning phosphorescent Pt(II) emitters is constructed, with more than two hundred samples collected from the literature. Features containing pertinent electronic properties of the complexes are chosen. Our results demonstrate that ensemble learning models combined with stacking-based approaches exhibit the best performance, where the values of squared correlation coefficients ($R^2$), mean absolute error (MAE), and root mean square error (RMSE) are 0.96, 7.21 nm and 13.00 nm for emission wavelength prediction, and 0.81, 0.11 and 0.15 for PL quantum yield prediction. For radiative decay rate constant ($k_r$), the obtained value of $R^2$ is 0.67 while MAE and RMSE are 0.21 and 0.25 (both in log scale), respectively. The accuracy of the protocol is further confirmed using 24 recently reported Pt(II) complexes, which demonstrates its reliability for a broad palette of Pt(II) emitters.We expect this protocol will become a valuable tool, accelerating the rational design of novel OLED materials with desired properties.  ( 3 min )
    Detection problems in the spiked matrix models. (arXiv:2301.05331v1 [math.ST])
    We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals. As an intermediate step, we find out sharp phase transition thresholds for the extreme eigenvalues of spiked random matrices, which generalize the Baik-Ben Arous-P\'{e}ch\'{e} (BBP) transition. We also prove the central limit theorem for the linear spectral statistics for the spiked random matrices and propose a hypothesis test based on it, which does not depend on the distribution of the signal or the noise. When the noise is non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix with additive noise. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.  ( 2 min )
    Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism. (arXiv:2301.05272v1 [cs.CL])
    Large Language Models (LLMs) have been making big waves in the machine learning community within the past few years. The impressive scalability of LLMs due to the advent of deep learning can be seen as a continuation of empiricist lingusitic methods, as opposed to rule-based linguistic methods that are grounded in a nativist perspective. Current LLMs are generally inaccessible to resource-constrained researchers, due to a variety of factors including closed source code. This work argues that this lack of accessibility could instill a nativist bias in researchers new to computational linguistics, given that new researchers may only have rule-based, nativist approaches to study to produce new work. Also, given that there are numerous critics of deep learning claiming that LLMs and related methods may soon lose their relevancy, we speculate that such an event could trigger a new wave of nativism in the language processing community. To prevent such a dramatic shift and placing favor in hybrid methods of rules and deep learning, we call upon researchers to open source their LLM code wherever possible to allow both empircist and hybrid approaches to remain accessible.  ( 2 min )
    A survey and taxonomy of loss functions in machine learning. (arXiv:2301.05579v1 [cs.LG])
    Most state-of-the-art machine learning techniques revolve around the optimisation of loss functions. Defining appropriate loss functions is therefore critical to successfully solving problems in this field. We present a survey of the most commonly used loss functions for a wide range of different applications, divided into classification, regression, ranking, sample generation and energy based modelling. Overall, we introduce 33 different loss functions and we organise them into an intuitive taxonomy. Each loss function is given a theoretical backing and we describe where it is best used. This survey aims to provide a reference of the most essential loss functions for both beginner and advanced machine learning practitioners.  ( 2 min )
    Building a Fuel Moisture Model for the Coupled Fire-Atmosphere Model WRF-SFIRE from Data: From Kalman Filters to Recurrent Neural Networks. (arXiv:2301.05427v1 [cs.LG])
    The current fuel moisture content (FMC) subsystems in WRF-SFIRE and its workflow system WRFx use a time-lag differential equation model with assimilation of data from FMC sensors on Remote Automated Weather Stations (RAWS) by the extended augmented Kalman filter. But the quality of the result is constrained by the limitations of the model and of the Kalman filter. We observe that the data flow in a system consisting of a model and the Kalman filter can be interpreted to be the same as the data flow in a recurrent neural network (RNN). Thus, instead of building more sophisticated models and data assimilation methods, we want to train a RNN to approximate the dynamics of the response of the FMC sensor to a time series of environmental data. Because standard AI approaches did not converge to reasonable solutions, we pre-train the RNN with special initial weights devised to turn it into a numerical solver of the differential equation. We then allow the AI training machinery to optimize the RNN weights to fit the data better. We illustrate the method on an example of a time series of 10h-FMC from RAWS and weather data from the Real-Time Mesoscale Analysis (RTMA).  ( 2 min )
    Decentralized model-free reinforcement learning in stochastic games with average-reward objective. (arXiv:2301.05630v1 [cs.LG])
    We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the "communicating" assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.  ( 2 min )
    Applied Computer Vision on 2-Dimensional Lung X-Ray Images for Assisted Medical Diagnosis of Pneumonia. (arXiv:2207.13295v1 [eess.IV] CROSS LISTED)
    This study focuses on the application of a specific subfield of artificial intelligence referred to as computer vision in the analysis of 2-dimensional lung x-ray images for the assisted medical diagnosis of ordinary pneumonia. A convolutional neural network algorithm was implemented in a Python-coded, Flask-based web application that can analyze x-ray images for the detection of ordinary pneumonia. Since convolutional neural network algorithms rely on machine learning for the identification and detection of patterns, a technique referred to as transfer learning was implemented to train the neural network in the identification and detection of patterns within the dataset. Open-source lung x-ray images were used as training data to create a knowledge base that served as the core element of the web application and the experimental design employed a 5-Trial Confirmatory Test for the validation of the web application. The results of the 5-Trial Confirmatory Test show the calculation of Diagnostic Precision Percentage per Trial, General Diagnostic Precision Percentage, and General Diagnostic Error Percentage while the Confusion Matrix further shows the relationship between the label and the corresponding diagnosis result of the web application on each test images. The developed web application can be used by medical practitioners in A.I.-assisted diagnosis of ordinary pneumonia, and by researchers in the fields of computer science and bioinformatics.  ( 2 min )
    Distributed Online Private Learning of Convex Nondecomposable Objectives. (arXiv:2206.07944v3 [math.OC] UPDATED)
    We deal with a general distributed constrained online learning problem with privacy over time-varying networks, where a class of nondecomposable objectives are considered. Under this setting, each node only controls a part of the global decision, and the goal of all nodes is to collaboratively minimize the global cost over a time horizon $T$ while guarantees the security of the transmitted information. For such problems, we first design a novel generic algorithm framework, named as DPSDA, of differentially private distributed online learning using the Laplace mechanism and the stochastic variants of dual averaging method. Note that in the dual updates, all nodes of DPSDA employ the noise-corrupted gradients for more generality. Then, we propose two algorithms, named as DPSDA-C and DPSDA-PS, under this framework. In DPSDA-C, the nodes implement a circulation-based communication in the primal updates so as to alleviate the disagreements over time-varying undirected networks. In addition, for the extension to time-varying directed ones, the nodes implement the broadcast-based push-sum dynamics in DPSDA-PS, which can achieve average consensus over arbitrary directed networks. Theoretical results show that both algorithms attain an expected regret upper bound in $\mathcal{O}( \sqrt{T} )$ when the objective function is convex, which matches the best utility achievable by cutting-edge algorithms. Finally, numerical experiment results on both synthetic and real-world datasets verify the effectiveness of our algorithms.  ( 2 min )
    Sem@$K$: Is my knowledge graph embedding model semantic-aware?. (arXiv:2301.05601v1 [cs.LG])
    Using knowledge graph embedding models (KGEMs) is a popular approach for predicting links in knowledge graphs (KGs). Traditionally, the performance of KGEMs for link prediction is assessed using rank-based metrics, which evaluate their ability to give high scores to ground-truth entities. However, the literature claims that the KGEM evaluation procedure would benefit from adding supplementary dimensions to assess. That is why, in this paper, we extend our previously introduced metric Sem@$K$ that measures the capability of models to predict valid entities w.r.t. domain and range constrains. In particular, we consider a broad range of KGs and take their respective characteristics into account to propose different versions of Sem@$K$. We also perform an extensive study of KGEM semantic awareness. Our experiments show that Sem@$K$ provides a new perspective on KGEM quality. Its joint analysis with rank-based metrics offer different conclusions on the predictive power of models. Regarding Sem@$K$, some KGEMs are inherently better than others, but this semantic superiority is not indicative of their performance w.r.t. rank-based metrics. In this work, we generalize conclusions about the relative performance of KGEMs w.r.t. rank-based and semantic-oriented metrics at the level of families of models. The joint analysis of the aforementioned metrics gives more insight into the peculiarities of each model. This work paves the way for a more comprehensive evaluation of KGEM adequacy for specific downstream tasks.  ( 2 min )
    Scalable Batch Acquisition for Deep Bayesian Active Learning. (arXiv:2301.05490v1 [cs.LG])
    In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.  ( 2 min )
    Multilingual Alzheimer's Dementia Recognition through Spontaneous Speech: a Signal Processing Grand Challenge. (arXiv:2301.05562v1 [eess.AS])
    This Signal Processing Grand Challenge (SPGC) targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD). Participants were invited to employ signal processing and machine learning methods to create predictive models based on spontaneous speech data. The Challenge has been designed to assess the extent to which predictive models built based on speech in one language (English) generalise to another language (Greek). To the best of our knowledge no work has investigated acoustic features of the speech signal in multilingual AD detection. Our baseline system used conventional machine learning algorithms with Active Data Representation of acoustic features, achieving accuracy of 73.91% on AD detection, and 4.95 root mean squared error on cognitive score prediction.  ( 2 min )
    Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning. (arXiv:2301.05664v1 [cs.LG])
    In safety-critical decision-making scenarios being able to identify worst-case outcomes, or dead-ends is crucial in order to develop safe and reliable policies in practice. These situations are typically rife with uncertainty due to unknown or stochastic characteristics of the environment as well as limited offline training data. As a result, the value of a decision at any time point should be based on the distribution of its anticipated effects. We propose a framework to identify worst-case decision points, by explicitly estimating distributions of the expected return of a decision. These estimates enable earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task. We demonstrate the utility of Distributional Dead-end Discovery (DistDeD) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.  ( 2 min )
    Amenable Sparse Network Investigator. (arXiv:2202.09284v2 [cs.LG] UPDATED)
    We present "Amenable Sparse Network Investigator" (ASNI) algorithm that utilizes a novel pruning strategy based on a sigmoid function that induces sparsity level globally over the course of one single round of training. The ASNI algorithm fulfills both tasks that current state-of-the-art strategies can only do one of them. The ASNI algorithm has two subalgorithms: 1) ASNI-I, 2) ASNI-II. ASNI-I learns an accurate sparse off-the-shelf network only in one single round of training. ASNI-II learns a sparse network and an initialization that is quantized, compressed, and from which the sparse network is trainable. The learned initialization is quantized since only two numbers are learned for initialization of nonzero parameters in each layer L. Thus, quantization levels for the initialization of the entire network is 2L. Also, the learned initialization is compressed because it is a set consisting of 2L numbers. The special sparse network that can be trained from such a quantized and compressed initialization is called amenable. To the best of our knowledge, there is no other algorithm that can learn a quantized and compressed initialization from which the network is still trainable and is able to solve both pruning tasks. Our numerical experiments show that there is a quantized and compressed initialization from which the learned sparse network can be trained and reach to an accuracy on a par with the dense version. We experimentally show that these 2L levels of quantization are concentration points of parameters in each layer of the learned sparse network by ASNI-I. To corroborate the above, we have performed a series of experiments utilizing networks such as ResNets, VGG-style, small convolutional, and fully connected ones on ImageNet, CIFAR10, and MNIST datasets.  ( 2 min )
    Accelerating nuclear-norm regularized low-rank matrix optimization through Burer-Monteiro decomposition. (arXiv:2204.14067v2 [cs.LG] UPDATED)
    This work proposes a rapid algorithm, BM-Global, for nuclear-norm-regularized convex and low-rank matrix optimization problems. BM-Global efficiently decreases the objective value via low-cost steps leveraging the nonconvex but smooth Burer-Monteiro (BM) decomposition, while effectively escapes saddle points and spurious local minima ubiquitous in the BM form to obtain guarantees of fast convergence rates to the global optima of the original nuclear-norm-regularized problem through aperiodic inexact proximal gradient steps on it. The proposed approach adaptively adjusts the rank for the BM decomposition and can provably identify an optimal rank for the BM decomposition problem automatically in the course of optimization through tools of manifold identification. BM-Global hence also spends significantly less time on parameter tuning than existing matrix-factorization methods, which require an exhaustive search for finding this optimal rank. Extensive experiments on real-world large-scale problems of recommendation systems, regularized kernel estimation, and molecular conformation confirm that BM-Global can indeed effectively escapes spurious local minima at which existing BM approaches are stuck, and is a magnitude faster than state-of-the-art algorithms for low-rank matrix optimization problems involving a nuclear-norm regularizer.  ( 2 min )
    Generalization Properties of NAS under Activation and Skip Connection Search. (arXiv:2209.07238v3 [cs.LG] UPDATED)
    Neural Architecture Search (NAS) has fostered the automatic discovery of state-of-the-art neural architectures. Despite the progress achieved with NAS, so far there is little attention to theoretical guarantees on NAS. In this work, we study the generalization properties of NAS under a unifying framework enabling (deep) layer skip connection search and activation function search. To this end, we derive the lower (and upper) bounds of the minimum eigenvalue of the Neural Tangent Kernel (NTK) under the (in)finite-width regime using a certain search space including mixed activation functions, fully connected, and residual neural networks. We use the minimum eigenvalue to establish generalization error bounds of NAS in the stochastic gradient descent training. Importantly, we theoretically and experimentally show how the derived results can guide NAS to select the top-performing architectures, even in the case without training, leading to a train-free algorithm based on our theory. Accordingly, our numerical validation shed light on the design of computationally efficient methods for NAS. Our analysis is non-trivial due to the coupling of various architectures and activation functions under the unifying framework and has its own interest in providing the lower bound of the minimum eigenvalue of NTK in deep learning theory.  ( 2 min )
    Out-Of-Distribution Detection Is Not All You Need. (arXiv:2211.16158v2 [cs.LG] UPDATED)
    The usage of deep neural networks in safety-critical systems is limited by our ability to guarantee their correct behavior. Runtime monitors are components aiming to identify unsafe predictions and discard them before they can lead to catastrophic consequences. Several recent works on runtime monitoring have focused on out-of-distribution (OOD) detection, i.e., identifying inputs that are different from the training data. In this work, we argue that OOD detection is not a well-suited framework to design efficient runtime monitors and that it is more relevant to evaluate monitors based on their ability to discard incorrect predictions. We call this setting out-ofmodel-scope detection and discuss the conceptual differences with OOD. We also conduct extensive experiments on popular datasets from the literature to show that studying monitors in the OOD setting can be misleading: 1. very good OOD results can give a false impression of safety, 2. comparison under the OOD setting does not allow identifying the best monitor to detect errors. Finally, we also show that removing erroneous training data samples helps to train better monitors.  ( 2 min )
    Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey. (arXiv:2205.04712v2 [cs.LG] UPDATED)
    The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.  ( 2 min )
    Hyperparameter Optimization as a Service on INFN Cloud. (arXiv:2301.05522v1 [cs.DC])
    The simplest and often most effective way of parallelizing the training of complex machine learning models is to execute several training instances on multiple machines, possibly scanning the hyperparameter space to optimize the underlying statistical model and the learning procedure. Often, such a meta learning procedure is limited by the ability of accessing securely a common database organizing the knowledge of the previous and ongoing trials. Exploiting opportunistic GPUs provided in different environments represents a further challenge when designing such optimization campaigns. In this contribution we discuss how a set of RestAPIs can be used to access a dedicated service based on INFN Cloud to monitor and possibly coordinate multiple training instances, with gradient-less optimization techniques, via simple HTTP requests. The service, named Hopaas (Hyperparameter OPtimization As A Service), is made of web interface and sets of APIs implemented with a FastAPI back-end running through Uvicorn and NGINX in a virtual instance of INFN Cloud. The optimization algorithms are currently based on Bayesian techniques as provided by Optuna. A Python front-end is also made available for quick prototyping. We present applications to hyperparameter optimization campaigns performed combining private, INFN Cloud and CINECA resources.  ( 2 min )
    A Deep Reinforcement Learning Framework For Column Generation. (arXiv:2206.02568v3 [math.OC] UPDATED)
    Column Generation (CG) is an iterative algorithm for solving linear programs (LPs) with an extremely large number of variables (columns). CG is the workhorse for tackling large-scale \textit{integer} linear programs, which rely on CG to solve LP relaxations within a branch and price algorithm. Two canonical applications are the Cutting Stock Problem (CSP) and Vehicle Routing Problem with Time Windows (VRPTW). In VRPTW, for example, each binary variable represents the decision to include or exclude a \textit{route}, of which there are exponentially many; CG incrementally grows the subset of columns being used, ultimately converging to an optimal solution. We propose RLCG, the first Reinforcement Learning (RL) approach for CG. Unlike typical column selection rules which myopically select a column based on local information at each iteration, we treat CG as a sequential decision-making problem: the column selected in a given iteration affects subsequent column selections. This perspective lends itself to a Deep Reinforcement Learning approach that uses Graph Neural Networks (GNNs) to represent the variable-constraint structure in the LP of interest. We perform an extensive set of experiments using the publicly available BPPLIB benchmark for CSP and Solomon benchmark for VRPTW. RLCG converges faster and reduces the number of CG iterations by 22.4\% for CSP and 40.9\% for VRPTW on average compared to a commonly used greedy policy. Our code is available at https://github.com/chichengmessi/reinforcement-learning-for-column-generation.git.  ( 2 min )
    Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization). (arXiv:2209.07263v3 [cs.LG] UPDATED)
    We study the average robustness notion in deep neural networks in (selected) wide and narrow, deep and shallow, as well as lazy and non-lazy training settings. We prove that in the under-parameterized setting, width has a negative effect while it improves robustness in the over-parameterized setting. The effect of depth closely depends on the initialization and the training mode. In particular, when initialized with LeCun initialization, depth helps robustness with the lazy training regime. In contrast, when initialized with Neural Tangent Kernel (NTK) and He-initialization, depth hurts the robustness. Moreover, under the non-lazy training regime, we demonstrate how the width of a two-layer ReLU network benefits robustness. Our theoretical developments improve the results by [Huang et al. NeurIPS21; Wu et al. NeurIPS21] and are consistent with [Bubeck and Sellke NeurIPS21; Bubeck et al. COLT21].  ( 2 min )
    TUSK: Task-Agnostic Unsupervised Keypoints. (arXiv:2206.08460v2 [cs.CV] UPDATED)
    Existing unsupervised methods for keypoint learning rely heavily on the assumption that a specific keypoint type (e.g. elbow, digit, abstract geometric shape) appears only once in an image. This greatly limits their applicability, as each instance must be isolated before applying the method-an issue that is never discussed or evaluated. We thus propose a novel method to learn Task-agnostic, UnSupervised Keypoints (TUSK) which can deal with multiple instances. To achieve this, instead of the commonly-used strategy of detecting multiple heatmaps, each dedicated to a specific keypoint type, we use a single heatmap for detection, and enable unsupervised learning of keypoint types through clustering. Specifically, we encode semantics into the keypoints by teaching them to reconstruct images from a sparse set of keypoints and their descriptors, where the descriptors are forced to form distinct clusters in feature space around learned prototypes. This makes our approach amenable to a wider range of tasks than any previous unsupervised keypoint method: we show experiments on multiple-instance detection and classification, object discovery, and landmark detection-all unsupervised-with performance on par with the state of the art, while also being able to deal with multiple instances.  ( 2 min )
    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space. (arXiv:2206.11895v4 [cs.CV] UPDATED)
    Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.  ( 2 min )
    Non-Stochastic CDF Estimation Using Threshold Queries. (arXiv:2301.05682v1 [cs.LG])
    Estimating the empirical distribution of a scalar-valued data set is a basic and fundamental task. In this paper, we tackle the problem of estimating an empirical distribution in a setting with two challenging features. First, the algorithm does not directly observe the data; instead, it only asks a limited number of threshold queries about each sample. Second, the data are not assumed to be independent and identically distributed; instead, we allow for an arbitrary process generating the samples, including an adaptive adversary. These considerations are relevant, for example, when modeling a seller experimenting with posted prices to estimate the distribution of consumers' willingness to pay for a product: offering a price and observing a consumer's purchase decision is equivalent to asking a single threshold query about their value, and the distribution of consumers' values may be non-stationary over time, as early adopters may differ markedly from late adopters. Our main result quantifies, to within a constant factor, the sample complexity of estimating the empirical CDF of a sequence of elements of $[n]$, up to $\varepsilon$ additive error, using one threshold query per sample. The complexity depends only logarithmically on $n$, and our result can be interpreted as extending the existing logarithmic-complexity results for noisy binary search to the more challenging setting where noise is non-stochastic. Along the way to designing our algorithm, we consider a more general model in which the algorithm is allowed to make a limited number of simultaneous threshold queries on each sample. We solve this problem using Blackwell's Approachability Theorem and the exponential weights method. As a side result of independent interest, we characterize the minimum number of simultaneous threshold queries required by deterministic CDF estimation algorithms.  ( 2 min )
    TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems. (arXiv:2301.05334v1 [cs.LG])
    Coordination is one of the most difficult aspects of multi-agent reinforcement learning (MARL). One reason is that agents normally choose their actions independently of one another. In order to see coordination strategies emerging from the combination of independent policies, the recent research has focused on the use of a centralized function (CF) that learns each agent's contribution to the team reward. However, the structure in which the environment is presented to the agents and to the CF is typically overlooked. We have observed that the features used to describe the coordination problem can be represented as vertex features of a latent graph structure. Here, we present TransfQMix, a new approach that uses transformers to leverage this latent structure and learn better coordination policies. Our transformer agents perform a graph reasoning over the state of the observable entities. Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents. TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of agents. This enables to deploy promising approaches to save training time and derive general policies in MARL, such as transfer learning, zero-shot transfer, and curriculum learning. We report TransfQMix's performances in the Spread and StarCraft II environments. In both settings, it outperforms state-of-the-art Q-Learning models, and it demonstrates effectiveness in solving problems that other methods can not solve.  ( 2 min )
    Deep Reinforcement Learning for Asset Allocation: Reward Clipping. (arXiv:2301.05300v1 [q-fin.CP])
    Recently, there are many trials to apply reinforcement learning in asset allocation for earning more stable profits. In this paper, we compare performance between several reinforcement learning algorithms - actor-only, actor-critic and PPO models. Furthermore, we analyze each models' character and then introduce the advanced algorithm, so called Reward clipping model. It seems that the Reward Clipping model is better than other existing models in finance domain, especially portfolio optimization - it has strength both in bull and bear markets. Finally, we compare the performance for these models with traditional investment strategies during decreasing and increasing markets.  ( 2 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v2 [cs.LG] UPDATED)
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    Deep Learning Symmetries and Their Lie Groups, Algebras, and Subalgebras from First Principles. (arXiv:2301.05638v1 [hep-ph])
    We design a deep-learning algorithm for the discovery and identification of the continuous group of symmetries present in a labeled dataset. We use fully connected neural networks to model the symmetry transformations and the corresponding generators. We construct loss functions that ensure that the applied transformations are symmetries and that the corresponding set of generators forms a closed (sub)algebra. Our procedure is validated with several examples illustrating different types of conserved quantities preserved by symmetry. In the process of deriving the full set of symmetries, we analyze the complete subgroup structure of the rotation groups $SO(2)$, $SO(3)$, and $SO(4)$, and of the Lorentz group $SO(1,3)$. Other examples include squeeze mapping, piecewise discontinuous labels, and $SO(10)$, demonstrating that our method is completely general, with many possible applications in physics and data science. Our study also opens the door for using a machine learning approach in the mathematical study of Lie groups and their properties.  ( 2 min )
    Almost Surely $\sqrt{T}$ Regret Bound for Adaptive LQR. (arXiv:2301.05537v1 [math.OC])
    The Linear-Quadratic Regulation (LQR) problem with unknown system parameters has been widely studied, but it has remained unclear whether $\tilde{ \mathcal{O}}(\sqrt{T})$ regret, which is the best known dependence on time, can be achieved almost surely. In this paper, we propose an adaptive LQR controller with almost surely $\tilde{ \mathcal{O}}(\sqrt{T})$ regret upper bound. The controller features a circuit-breaking mechanism, which circumvents potential safety breach and guarantees the convergence of the system parameter estimate, but is shown to be triggered only finitely often and hence has negligible effect on the asymptotic performance of the controller. The proposed controller is also validated via simulation on Tennessee Eastman Process~(TEP), a commonly used industrial process example.  ( 2 min )
    A Solver-Free Framework for Scalable Learning in Neural ILP Architectures. (arXiv:2210.09082v2 [cs.LG] UPDATED)
    There is a recent focus on designing architectures that have an Integer Linear Programming (ILP) layer within a neural model (referred to as Neural ILP in this paper). Neural ILP architectures are suitable for pure reasoning tasks that require data-driven constraint learning or for tasks requiring both perception (neural) and reasoning (ILP). A recent SOTA approach for end-to-end training of Neural ILP explicitly defines gradients through the ILP black box (Paulus et al. 2021) - this trains extremely slowly, owing to a call to the underlying ILP solver for every training data point in a minibatch. In response, we present an alternative training strategy that is solver-free, i.e., does not call the ILP solver at all at training time. Neural ILP has a set of trainable hyperplanes (for cost and constraints in ILP), together representing a polyhedron. Our key idea is that the training loss should impose that the final polyhedron separates the positives (all constraints satisfied) from the negatives (at least one violated constraint or a suboptimal cost value), via a soft-margin formulation. While positive example(s) are provided as part of the training data, we devise novel techniques for generating negative samples. Our solution is flexible enough to handle equality as well as inequality constraints. Experiments on several problems, both perceptual as well as symbolic, which require learning the constraints of an ILP, show that our approach has superior performance and scales much better compared to purely neural baselines and other state-of-the-art models that require solver-based training. In particular, we are able to obtain excellent performance in 9 x 9 symbolic and visual sudoku, to which the other Neural ILP solver is not able to scale.  ( 2 min )
    On "Deep Learning" Misconduct. (arXiv:2211.16350v4 [cs.LG] UPDATED)
    This is a theoretical paper, as a companion paper of the plenary talk for the same conference ISAIC 2022. In contrast to the author's plenary talk in the same conference, conscious learning (Weng, 2022b; Weng, 2022c) which develops a single network for a life (many tasks), "Deep Learning" trains multiple networks for each task. Although "Deep Learning" may use different learning modes, including supervised, reinforcement and adversarial modes, almost all "Deep Learning" projects apparently suffer from the same misconduct, called "data deletion" and "test on training data". This paper establishes a theorem that a simple method called Pure-Guess Nearest Neighbor (PGNN) reaches any required errors on validation data set and test data set, including zero-error requirements, through the same misconduct, as long as the test data set is in the possession of the authors and both the amount of storage space and the time of training are finite but unbounded. The misconduct violates well-known protocols called transparency and cross-validation. The nature of the misconduct is fatal, because in the absence of any disjoint test, "Deep Learning" is clearly not generalizable.  ( 2 min )
    Language-Informed Transfer Learning for Embodied Household Activities. (arXiv:2301.05318v1 [cs.RO])
    For service robots to become general-purpose in everyday household environments, they need not only a large library of primitive skills, but also the ability to quickly learn novel tasks specified by users. Fine-tuning neural networks on a variety of downstream tasks has been successful in many vision and language domains, but research is still limited on transfer learning between diverse long-horizon tasks. We propose that, compared to reinforcement learning for a new household activity from scratch, home robots can benefit from transferring the value and policy networks trained for similar tasks. We evaluate this idea in the BEHAVIOR simulation benchmark which includes a large number of household activities and a set of action primitives. For easy mapping between state spaces of different tasks, we provide a text-based representation and leverage language models to produce a common embedding space. The results show that the selection of similar source activities can be informed by the semantic similarity of state and goal descriptions with the target task. We further analyze the results and discuss ways to overcome the problem of catastrophic forgetting.  ( 2 min )
    Data-Efficient Structured Pruning via Submodular Optimization. (arXiv:2203.04940v3 [cs.LG] UPDATED)
    Structured pruning is an effective approach for compressing large pre-trained neural networks without significantly affecting their performance. However, most current structured pruning methods do not provide any performance guarantees, and often require fine-tuning, which makes them inapplicable in the limited-data regime. We propose a principled data-efficient structured pruning method based on submodular optimization. In particular, for a given layer, we select neurons/channels to prune and corresponding new weights for the next layer, that minimize the change in the next layer's input induced by pruning. We show that this selection problem is a weakly submodular maximization problem, thus it can be provably approximated using an efficient greedy algorithm. Our method is guaranteed to have an exponentially decreasing error between the original model and the pruned model outputs w.r.t the pruned size, under reasonable assumptions. It is also one of the few methods in the literature that uses only a limited-number of training data and no labels. Our experimental results demonstrate that our method outperforms state-of-the-art methods in the limited-data regime.  ( 2 min )
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v3 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    Learning with little mixing. (arXiv:2206.08269v2 [cs.LG] UPDATED)
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.  ( 2 min )
    Biases in Inverse Ising Estimates of Near-Critical Behaviour. (arXiv:2301.05556v1 [cond-mat.dis-nn])
    Inverse Ising inference allows pairwise interactions of complex binary systems to be reconstructed from empirical correlations. Typical estimators used for this inference, such as Pseudo-likelihood maximization (PLM), are biased. Using the Sherrington-Kirkpatrick (SK) model as a benchmark, we show that these biases are large in critical regimes close to phase boundaries, and may alter the qualitative interpretation of the inferred model. In particular, we show that the small-sample bias causes models inferred through PLM to appear closer-to-criticality than one would expect from the data. Data-driven methods to correct this bias are explored and applied to a functional magnetic resonance imaging (fMRI) dataset from neuroscience. Our results indicate that additional care should be taken when attributing criticality to real-world datasets.  ( 2 min )
    An efficient hybrid classification approach for COVID-19 based on Harris Hawks Optimization and Salp Swarm Optimization. (arXiv:2301.05296v1 [cs.NE])
    Feature selection can be defined as one of the pre-processing steps that decrease the dimensionality of a dataset by identifying the most significant attributes while also boosting the accuracy of classification. For solving feature selection problems, this study presents a hybrid binary version of the Harris Hawks Optimization algorithm (HHO) and Salp Swarm Optimization (SSA) (HHOSSA) for Covid-19 classification. The proposed (HHOSSA) presents a strategy for improving the basic HHO's performance using the Salp algorithm's power to select the best fitness values. The HHOSSA was tested against two well-known optimization algorithms, the Whale Optimization Algorithm (WOA) and the Grey wolf optimizer (GWO), utilizing a total of 800 chest X-ray images. A total of four performance metrics (Accuracy, Recall, Precision, F1) were employed in the studies using three classifiers (Support vector machines (SVMs), k-Nearest Neighbor (KNN), and Extreme Gradient Boosting (XGBoost)). The proposed algorithm (HHOSSA) achieved 96% accuracy with the SVM classifier, and 98% accuracy with two classifiers, XGboost and KNN.  ( 2 min )
    Neural network with optimal neuron activation functions based on additive Gaussian process regression. (arXiv:2301.05567v1 [stat.ML])
    Feed-forward neural networks (NN) are a staple machine learning method widely used in many areas of science and technology. While even a single-hidden layer NN is a universal approximator, its expressive power is limited by the use of simple neuron activation functions (such as sigmoid functions) that are typically the same for all neurons. More flexible neuron activation functions would allow using fewer neurons and layers and thereby save computational cost and improve expressive power. We show that additive Gaussian process regression (GPR) can be used to construct optimal neuron activation functions that are individual to each neuron. An approach is also introduced that avoids non-linear fitting of neural network parameters. The resulting method combines the advantage of robustness of a linear regression with the higher expressive power of a NN. We demonstrate the approach by fitting the potential energy surface of the water molecule. Without requiring any non-linear optimization, the additive GPR based approach outperforms a conventional NN in the high accuracy regime, where a conventional NN suffers more from overfitting.  ( 2 min )
    On the explainability of quantum neural networks based on variational quantum circuits. (arXiv:2301.05549v1 [quant-ph])
    Ridge functions are used to describe and study the lower bound of the approximation done by the neural networks which can be written as a linear combination of activation functions. If the activation functions are also ridge functions, these networks are called explainable neural networks. In this paper, we first show that quantum neural networks which are based on variational quantum circuits can be written as a linear combination of ridge functions. Consequently, we show that the interpretability and explainability of such quantum neural networks can be directly considered and studied as an approximation with the linear combination of ridge functions.  ( 2 min )
    Knowledge Enhancement for Multi-Behavior Contrastive Recommendation. (arXiv:2301.05403v1 [cs.IR])
    A well-designed recommender system can accurately capture the attributes of users and items, reflecting the unique preferences of individuals. Traditional recommendation techniques usually focus on modeling the singular type of behaviors between users and items. However, in many practical recommendation scenarios (e.g., social media, e-commerce), there exist multi-typed interactive behaviors in user-item relationships, such as click, tag-as-favorite, and purchase in online shopping platforms. Thus, how to make full use of multi-behavior information for recommendation is of great importance to the existing system, which presents challenges in two aspects that need to be explored: (1) Utilizing users' personalized preferences to capture multi-behavioral dependencies; (2) Dealing with the insufficient recommendation caused by sparse supervision signal for target behavior. In this work, we propose a Knowledge Enhancement Multi-Behavior Contrastive Learning Recommendation (KMCLR) framework, including two Contrastive Learning tasks and three functional modules to tackle the above challenges, respectively. In particular, we design the multi-behavior learning module to extract users' personalized behavior information for user-embedding enhancement, and utilize knowledge graph in the knowledge enhancement module to derive more robust knowledge-aware representations for items. In addition, in the optimization stage, we model the coarse-grained commonalities and the fine-grained differences between multi-behavior of users to further improve the recommendation effect. Extensive experiments and ablation tests on the three real-world datasets indicate our KMCLR outperforms various state-of-the-art recommendation methods and verify the effectiveness of our method.  ( 2 min )
    AAAI 2022 Fall Symposium: Lessons Learned for Autonomous Assessment of Machine Abilities (LLAAMA). (arXiv:2301.05384v1 [cs.LG])
    Modern civilian and military systems have created a demand for sophisticated intelligent autonomous machines capable of operating in uncertain dynamic environments. Such systems are realizable thanks in large part to major advances in perception and decision-making techniques, which in turn have been propelled forward by modern machine learning tools. However, these newer forms of intelligent autonomy raise questions about when/how communication of the operational intent and assessments of actual vs. supposed capabilities of autonomous agents impact overall performance. This symposium examines the possibilities for enabling intelligent autonomous systems to self-assess and communicate their ability to effectively execute assigned tasks, as well as reason about the overall limits of their competencies and maintain operability within those limits. The symposium brings together researchers working in this burgeoning area of research to share lessons learned, identify major theoretical and practical challenges encountered so far, and potential avenues for future research and real-world applications.  ( 2 min )
    HTTE: A Hybrid Technique For Travel Time Estimation In Sparse Data Environments. (arXiv:2301.05293v1 [cs.LG])
    Travel time estimation is a critical task, useful to many urban applications at the individual citizen and the stakeholder level. This paper presents a novel hybrid algorithm for travel time estimation that leverages historical and sparse real-time trajectory data. Given a path and a departure time we estimate the travel time taking into account the historical information, the real-time trajectory data and the correlations among different road segments. We detect similar road segments using historical trajectories, and use a latent representation to model the similarities. Our experimental evaluation demonstrates the effectiveness of our approach.  ( 2 min )
    In BLOOM: Creativity and Affinity in Artificial Lyrics and Art. (arXiv:2301.05402v1 [cs.CL])
    We apply a large multilingual language model (BLOOM-176B) in open-ended generation of Chinese song lyrics, and evaluate the resulting lyrics for coherence and creativity using human reviewers. We find that current computational metrics for evaluating large language model outputs (MAUVE) have limitations in evaluation of creative writing. We note that the human concept of creativity requires lyrics to be both comprehensible and distinctive -- and that humans assess certain types of machine-generated lyrics to score more highly than real lyrics by popular artists. Inspired by the inherently multimodal nature of album releases, we leverage a Chinese-language stable diffusion model to produce high-quality lyric-guided album art, demonstrating a creative approach for an artist seeking inspiration for an album or single. Finally, we introduce the MojimLyrics dataset, a Chinese-language dataset of popular song lyrics for future research.  ( 2 min )
    A Constrained-Optimization Approach to the Execution of Prioritized Stacks of Learned Multi-Robot Tasks. (arXiv:2301.05346v1 [cs.RO])
    This paper presents a constrained-optimization formulation for the prioritized execution of learned robot tasks. The framework lends itself to the execution of tasks encoded by value functions, such as tasks learned using the reinforcement learning paradigm. The tasks are encoded as constraints of a convex optimization program by using control Lyapunov functions. Moreover, an additional constraint is enforced in order to specify relative priorities between the tasks. The proposed approach is showcased in simulation using a team of mobile robots executing coordinated multi-robot tasks.  ( 2 min )
    A Scalable Technique for Weak-Supervised Learning with Domain Constraints. (arXiv:2301.05253v1 [cs.LG])
    We propose a novel scalable end-to-end pipeline that uses symbolic domain knowledge as constraints for learning a neural network for classifying unlabeled data in a weak-supervised manner. Our approach is particularly well-suited for settings where the data consists of distinct groups (classes) that lends itself to clustering-friendly representation learning and the domain constraints can be reformulated for use of efficient mathematical optimization techniques by considering multiple training examples at once. We evaluate our approach on a variant of the MNIST image classification problem where a training example consists of image sequences and the sum of the numbers represented by the sequences, and show that our approach scales significantly better than previous approaches that rely on computing all constraint satisfying combinations for each training example.  ( 2 min )
    Equivariant Representations for Non-Free Group Actions. (arXiv:2301.05231v1 [cs.LG])
    We introduce a method for learning representations that are equivariant with respect to general group actions over data. Differently from existing equivariant representation learners, our method is suitable for actions that are not free i.e., that stabilize data via nontrivial symmetries. Our method is grounded in the orbit-stabilizer theorem from group theory, which guarantees that an ideal learner infers an isomorphic representation. Finally, we provide an empirical investigation on image datasets with rotational symmetries and show that taking stabilizers into account improves the quality of the representations.  ( 2 min )
  • Open

    Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning. (arXiv:2301.05664v1 [cs.LG])
    In safety-critical decision-making scenarios being able to identify worst-case outcomes, or dead-ends is crucial in order to develop safe and reliable policies in practice. These situations are typically rife with uncertainty due to unknown or stochastic characteristics of the environment as well as limited offline training data. As a result, the value of a decision at any time point should be based on the distribution of its anticipated effects. We propose a framework to identify worst-case decision points, by explicitly estimating distributions of the expected return of a decision. These estimates enable earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task. We demonstrate the utility of Distributional Dead-end Discovery (DistDeD) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.  ( 2 min )
    Neural network with optimal neuron activation functions based on additive Gaussian process regression. (arXiv:2301.05567v1 [stat.ML])
    Feed-forward neural networks (NN) are a staple machine learning method widely used in many areas of science and technology. While even a single-hidden layer NN is a universal approximator, its expressive power is limited by the use of simple neuron activation functions (such as sigmoid functions) that are typically the same for all neurons. More flexible neuron activation functions would allow using fewer neurons and layers and thereby save computational cost and improve expressive power. We show that additive Gaussian process regression (GPR) can be used to construct optimal neuron activation functions that are individual to each neuron. An approach is also introduced that avoids non-linear fitting of neural network parameters. The resulting method combines the advantage of robustness of a linear regression with the higher expressive power of a NN. We demonstrate the approach by fitting the potential energy surface of the water molecule. Without requiring any non-linear optimization, the additive GPR based approach outperforms a conventional NN in the high accuracy regime, where a conventional NN suffers more from overfitting.  ( 2 min )
    Learning with little mixing. (arXiv:2206.08269v2 [cs.LG] UPDATED)
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.  ( 2 min )
    Scalable Estimation for Structured Additive Distributional Regression. (arXiv:2301.05593v1 [stat.CO])
    Recently, fitting probabilistic models have gained importance in many areas but estimation of such distributional models with very large data sets is a difficult task. In particular, the use of rather complex models can easily lead to memory-related efficiency problems that can make estimation infeasible even on high-performance computers. We therefore propose a novel backfitting algorithm, which is based on the ideas of stochastic gradient descent and can deal virtually with any amount of data on a conventional laptop. The algorithm performs automatic selection of variables and smoothing parameters, and its performance is in most cases superior or at least equivalent to other implementations for structured additive distributional regression, e.g., gradient boosting, while maintaining low computation time. Performance is evaluated using an extensive simulation study and an exceptionally challenging and unique example of lightning count prediction over Austria. A very large dataset with over 9 million observations and 80 covariates is used, so that a prediction model cannot be estimated with standard distributional regression methods but with our new approach.  ( 2 min )
    Scalable Batch Acquisition for Deep Bayesian Active Learning. (arXiv:2301.05490v1 [cs.LG])
    In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.  ( 2 min )
    On the infinite-depth limit of finite-width neural networks. (arXiv:2210.00688v3 [stat.ML] UPDATED)
    In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing change of regime phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width and compare it with the more commonly studied infinite-width-then-infinite-depth limit.  ( 2 min )
    Fully Adaptive Composition in Differential Privacy. (arXiv:2203.05481v2 [cs.LG] UPDATED)
    Composition is a key feature of differential privacy. Well-known advanced composition theorems allow one to query a private database quadratically more times than basic privacy composition would permit. However, these results require that the privacy parameters of all algorithms be fixed before interacting with the data. To address this, Rogers et al. introduced fully adaptive composition, wherein both algorithms and their privacy parameters can be selected adaptively. The authors introduce two probabilistic objects to measure privacy in adaptive composition: privacy filters, which provide differential privacy guarantees for composed interactions, and privacy odometers, time-uniform bounds on privacy loss. There are substantial gaps between advanced composition and existing filters and odometers. First, existing filters place stronger assumptions on the algorithms being composed. Second, these odometers and filters suffer from large constants, making them impractical. We construct filters that match the tightness of advanced composition, including constants, despite allowing for adaptively chosen privacy parameters. En route we also derive a privacy filter for approximate zCDP and approximate RDP. We also construct several general families of odometers. These odometers can match the tightness of advanced composition at an arbitrary, preselected point in time, or at all points in time simultaneously, up to a doubly-logarithmic factor. We obtain our results by leveraging recent advances in time-uniform martingale concentration. In sum, we show that fully adaptive privacy is obtainable at almost no loss, and conjecture that our results are essentially unimprovable (even in constants) in general.  ( 2 min )
    Stable Probability Weighting: Large-Sample and Finite-Sample Estimation and Inference Methods for Heterogeneous Causal Effects of Multivalued Treatments Under Limited Overlap. (arXiv:2301.05703v1 [econ.EM])
    In this paper, I try to tame "Basu's elephants" (data with extreme selection on observables). I propose new practical large-sample and finite-sample methods for estimating and inferring heterogeneous causal effects (under unconfoundedness) in the empirically relevant context of limited overlap. I develop a general principle called "Stable Probability Weighting" (SPW) that can be used as an alternative to the widely used Inverse Probability Weighting (IPW) technique, which relies on strong overlap. I show that IPW (or its augmented version), when valid, is a special case of the more general SPW (or its doubly robust version), which adjusts for the extremeness of the conditional probabilities of the treatment states. The SPW principle can be implemented using several existing large-sample parametric, semiparametric, and nonparametric procedures for conditional moment models. In addition, I provide new finite-sample results that apply when unconfoundedness is plausible within fine strata. Since IPW estimation relies on the problematic reciprocal of the estimated propensity score, I develop a "Finite-Sample Stable Probability Weighting" (FPW) set-estimator that is unbiased in a sense. I also propose new finite-sample inference methods for testing a general class of weak null hypotheses. The associated computationally convenient methods, which can be used to construct valid confidence sets and to bound the finite-sample confidence distribution, are of independent interest. My large-sample and finite-sample frameworks extend to the setting of multivalued treatments.  ( 2 min )
    Memory Efficient Continual Learning with Transformers. (arXiv:2203.04640v2 [cs.CL] UPDATED)
    In many real-world scenarios, data to train machine learning models becomes available over time. Unfortunately, these models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is difficult to prevent due to practical constraints. For instance, the amount of data that can be stored or the computational resources that can be used might be limited. Moreover, applications increasingly rely on large pre-trained neural networks, such as pre-trained Transformers, since the resources or data might not be available in sufficiently large quantities to practitioners to train the model from scratch. In this paper, we devise a method to incrementally train a model on a sequence of tasks using pre-trained Transformers and extending them with Adapters. Different than the existing approaches, our method is able to scale to a large number of tasks without significant overhead and allows sharing information across tasks. On both image and text classification tasks, we empirically demonstrate that our method maintains a good predictive performance without retraining the model or increasing the number of model parameters over time. The resulting model is also significantly faster at inference time compared to Adapter-based state-of-the-art methods.  ( 2 min )
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v3 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    A fully Bayesian sparse polynomial chaos expansion approach with joint priors on the coefficients and global selection of terms. (arXiv:2204.06043v2 [stat.CO] UPDATED)
    Polynomial chaos expansion (PCE) is a versatile tool widely used in uncertainty quantification and machine learning, but its successful application depends strongly on the accuracy and reliability of the resulting PCE-based response surface. High accuracy typically requires high polynomial degrees, demanding many training points especially in high-dimensional problems through the curse of dimensionality. So-called sparse PCE concepts work with a much smaller selection of basis polynomials compared to conventional PCE approaches and can overcome the curse of dimensionality very efficiently, but have to pay specific attention to their strategies of choosing training points. Furthermore, the approximation error resembles an uncertainty that most existing PCE-based methods do not estimate. In this study, we develop and evaluate a fully Bayesian approach to establish the PCE representation via joint shrinkage priors and Markov chain Monte Carlo. The suggested Bayesian PCE model directly aims to solve the two challenges named above: achieving a sparse PCE representation and estimating uncertainty of the PCE itself. The embedded Bayesian regularizing via the joint shrinkage prior allows using higher polynomial degrees for given training points due to its ability to handle underdetermined situations, where the number of considered PCE coefficients could be much larger than the number of available training points. We also explore multiple variable selection methods to construct sparse PCE expansions based on the established Bayesian representations, while globally selecting the most meaningful orthonormal polynomials given the available training data. We demonstrate the advantages of our Bayesian PCE and the corresponding sparsity-inducing methods on several benchmarks.  ( 2 min )
    Detection problems in the spiked matrix models. (arXiv:2301.05331v1 [math.ST])
    We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals. As an intermediate step, we find out sharp phase transition thresholds for the extreme eigenvalues of spiked random matrices, which generalize the Baik-Ben Arous-P\'{e}ch\'{e} (BBP) transition. We also prove the central limit theorem for the linear spectral statistics for the spiked random matrices and propose a hypothesis test based on it, which does not depend on the distribution of the signal or the noise. When the noise is non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix with additive noise. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.  ( 2 min )
    Efficient and robust transfer learning of optimal individualized treatment regimes with right-censored survival data. (arXiv:2301.05491v1 [stat.ME])
    An individualized treatment regime (ITR) is a decision rule that assigns treatments based on patients' characteristics. The value function of an ITR is the expected outcome in a counterfactual world had this ITR been implemented. Recently, there has been increasing interest in combining heterogeneous data sources, such as leveraging the complementary features of randomized controlled trial (RCT) data and a large observational study (OS). Usually, a covariate shift exists between the source and target population, rendering the source-optimal ITR unnecessarily optimal for the target population. We present an efficient and robust transfer learning framework for estimating the optimal ITR with right-censored survival data that generalizes well to the target population. The value function accommodates a broad class of functionals of survival distributions, including survival probabilities and restrictive mean survival times (RMSTs). We propose a doubly robust estimator of the value function, and the optimal ITR is learned by maximizing the value function within a pre-specified class of ITRs. We establish the $N^{-1/3}$ rate of convergence for the estimated parameter indexing the optimal ITR, and show that the proposed optimal value estimator is consistent and asymptotically normal even with flexible machine learning methods for nuisance parameter estimation. We evaluate the empirical performance of the proposed method by simulation studies and a real data application of sodium bicarbonate therapy for patients with severe metabolic acidaemia in the intensive care unit (ICU), combining a RCT and an observational study with heterogeneity.  ( 2 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v2 [cs.LG] UPDATED)
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    Global Riemannian Acceleration in Hyperbolic and Spherical Spaces. (arXiv:2012.03618v5 [math.OC] UPDATED)
    We further research on the accelerated optimization phenomenon on Riemannian manifolds by introducing accelerated global first-order methods for the optimization of $L$-smooth and geodesically convex (g-convex) or $\mu$-strongly g-convex functions defined on the hyperbolic space or a subset of the sphere. For a manifold other than the Euclidean space, these are the first methods to \emph{globally} achieve the same rates as accelerated gradient descent in the Euclidean space with respect to $L$ and $\epsilon$ (and $\mu$ if it applies), up to log factors. Due to the geometric deformations, our rates have an extra factor, depending on the initial distance $R$ to a minimizer and the curvature $K$, with respect to Euclidean accelerated algorithms As a proxy for our solution, we solve a constrained non-convex Euclidean problem, under a condition between convexity and \emph{quasar-convexity}, of independent interest. Additionally, for any Riemannian manifold of bounded sectional curvature, we provide reductions from optimization methods for smooth and g-convex functions to methods for smooth and strongly g-convex functions and vice versa. We also reduce global optimization to optimization over bounded balls where the effect of the curvature is reduced.  ( 2 min )

  • Open

    Looking for a CV/ML freelancer
    We are currently working looking for someone to create an app that works for images and video where the user would highlight the outline of the person in the image or video and the app would return the image and video of the person with a transparent background. The user could then go back and keep highlighting to refine the image or video of the person. If the image or video of said person is good they would just save the it on the app itself. We would want this app to be made with swift for iOS and preferably on edge. At the end just send over the project folder. Dm if you are interested submitted by /u/bluebamboo3 [link] [comments]  ( 45 min )
    This AI can clone your voice! VALL-E (explained)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 44 min )
    A Video Script I Made Using ChatGPT & Pictory
    submitted by /u/WolfAmoux [link] [comments]  ( 44 min )
    6. Is it possible for crime to increase due to AI in the workforce?
    submitted by /u/Big-Heron-7955 [link] [comments]  ( 44 min )
    Did this mostly for fun once I found out the CW's Shared superhero multiverse was coming to a close. Figured I might as well post it here for anyone whose interested.
    submitted by /u/Eleganos [link] [comments]  ( 44 min )
    Make a Drake freestyle with AI
    Drake - Zesty (New Unreleased Freestyle) ​ Recently just uploaded his new freestyle on youtube. Pretty lit and original. Check it out: https://youtu.be/060cUnsoYo8 submitted by /u/Jay-Query [link] [comments]  ( 44 min )
    The media firms & publisher are beginning their fight back
    submitted by /u/MrEloi [link] [comments]  ( 45 min )
    Artists sue Stability AI, Midjourney and DeviantArt
    submitted by /u/Peaking_AI [link] [comments]  ( 44 min )
    Weekly China AI News: Alibaba Predicts Generative AI as Top Tech Trend for 2023, China's AI Computing Surpasses General Computing, and AI Can Recognize Lip Syncing
    submitted by /u/trcytony [link] [comments]  ( 44 min )
    Artificial intelligence
    submitted by /u/ZahraMuxammed [link] [comments]  ( 43 min )
    5 AI Content Detector Tools You Should Know About! (ChatGPT Not Included)
    submitted by /u/Chisom1998_ [link] [comments]  ( 50 min )
    Interview with the guy who developed wifi human movement tracking
    in this newsletter like halfway down how long before this just turns into the Dark Knight surveillance scenario? apparently you can use wifi antennas to track people submitted by /u/jrstelle [link] [comments]  ( 45 min )
    I made my first website with AI. It is Ingredient Genie, and it creates recipes based on your ingredients.
    submitted by /u/MightyMercenary0 [link] [comments]  ( 45 min )
    🚀Muse:Text 2 Image Generation via Masked Generative Transformers
    submitted by /u/oridnary_artist [link] [comments]  ( 45 min )
    Create an image generation model that takes concept art and turns it into a character sheet
    Hey all! I have been experimenting with a few AI apps (I was lucky enough to get into the leonardo beta). I am mainly interested in creating character sheets for 3D modelling like this: ​ https://preview.redd.it/7vv9zotsbfca1.jpg?width=564&format=pjpg&auto=webp&s=37df5fc0affb728338a3517d5710453b72b0830b Initially, I thought that training a model with a bunch of character sheets from google images would suffice. I figured if I used an image of a character (think general concept art, without the T-pose and different views) as the "input" image, it would spit out something resembling a character sheet, only with the character I want in it. This didn't work, it ended up re-imagining the character with the art style of the character sheet (hand drawn lines, cartoony): ​ https://preview.redd.it/rp6hh8qgdfca1.png?width=1486&format=png&auto=webp&s=81fb7607cbd58a938f25314c471772aaa5d8916c So I guess my question is: Is there an AI service/app/tool that might accomplish this? If not, what other methods should I look into? submitted by /u/matthew798 [link] [comments]  ( 48 min )
    OpenAI’s ChatGPT: The 10 Worst Things to Expect
    submitted by /u/liquidocelotYT [link] [comments]  ( 43 min )
    Interactive Evolutionary Computation and ChatGPT
    submitted by /u/BenjaminJamesBush [link] [comments]  ( 45 min )
    AI Host that independently runs Live shows (on Fb, Youtube, Twitch)
    Hi there! Making my first contribution to this subreddit and sharing something you probably haven't heard of (I might be wrong though :) So, this is about AI host for Live quizzes. The AI generated avatar can run a live quiz and doesn't need any pre-made scripts or your help. Works only on Facebook Live, Youtube and Twitch. What do you think? ​ https://preview.redd.it/94gb6wmc6fca1.png?width=1080&format=png&auto=webp&s=a14bfd0b1140e887d1b7c615fd0ef00e88373ce6 submitted by /u/AnnetWw [link] [comments]  ( 50 min )
    What is reinforcement learning from human feedback (RLHF)?
    submitted by /u/bendee983 [link] [comments]  ( 44 min )
    AI in Education: The Good, the Bad, and the Downright Confusing
    submitted by /u/pauerrrr [link] [comments]  ( 44 min )
    Android AI Assistant - Use GPT from anywhere!
    submitted by /u/better__ideas [link] [comments]  ( 44 min )
    I got ChatGPT to create a new joke. I would never have thought this possible.
    submitted by /u/Ivorius [link] [comments]  ( 48 min )
    Didn't a man invent ChatGPT?
    submitted by /u/Imagine-your-success [link] [comments]  ( 44 min )
    I made SaaS AI Tools, a collection of 400+ AI tools & daily AI news in one place.
    Hey, Over the past couple months, I've been collecting AI tools & generators and decided to put them into a website. The result is SaaS AI Tools, a growing collection of 400+ generative AI tools to help supercharge your creativity and take your business to the next level. Also, to differentiate a bit - I've added another section that involves a feed of daily AI articles, so you can keep up-to-date on the top AI headlines. This is how I'm personally keeping up with all the AI stuff today. I'll be adding more tools and news sources soon. I've launched the website on Product Hunt and would appreciate any of your support 🙏 submitted by /u/Hairy_Milk8431 [link] [comments]  ( 56 min )
    production still from 1976 of Alejandro Jodorowsky’s Spaceballs
    submitted by /u/dag [link] [comments]  ( 44 min )
  • Open

    [D] GCN datasets
    Hello everyone, ​ I have a question about GCNs and would appreciate any thoughts. Do we typically use only one graph for GCN training/inference? I'm asking this because when I saw official DGL website, there was only one example graph after loading it. Based on my experience with DNNs, I expected a batch of examples. However, it was not the case for GCNS. I could find PPI dataset with multiple graph examples (24) but for other widely used datasets (e.g., Cora, Citeseeer, and Pubmed), there was only one. Thank you! submitted by /u/ramya_1995 [link] [comments]  ( 56 min )
    [P] Looking for a CV/ML freelancer
    We are currently working looking for someone to create an app that works for images and video where the user would highlight the outline of the person in the image or video and the app would return the image and video of the person with a transparent background. The user could then go back and keep highlighting to refine the image or video of the person. If the image or video of said person is good they would just save the it on the app itself. We would want this app to be made with swift for iOS and preferably on edge. At the end just send over the project folder. Dm if you are interested. submitted by /u/bluebamboo3 [link] [comments]  ( 56 min )
    [D] Model for detecting rectangle corners?
    What model structure would be recommended for detecting the coordinates of all 4 corners of a rectangle (e.g. index cards)? Most object detection models like YOLO produce rectangular bounding boxes; what tweaks can be made to trace the object regardless of orientation? For my specific problem, classical edge/corner detectors aren't a good fit - so I'm falling back on ML. Currently have a dataset of about 1500 domain-specific labeled images; hoping to train a model on TF. Thanks for the suggestions! Edit: here are a few examples from my dataset. The green dots aren't part of the images; they just show how the corners are annotated: https://preview.redd.it/2f8uimhn7hca1.jpg?width=1373&format=pjpg&auto=webp&s=3a3757a6d3ab0f07aa3cde09f1b4acd0573f3d75 https://preview.redd.it/ujb8tmhn7hca1.jpg?width=3024&format=pjpg&auto=webp&s=e1a60b4322e3f20c10f193cb3102658975858c92 https://preview.redd.it/9lzgfmhn7hca1.jpg?width=3024&format=pjpg&auto=webp&s=a0cf4d760b48267d7c273f892284472f296f72be submitted by /u/hundley10 [link] [comments]  ( 57 min )
    [R] The Predictive Forward-Forward Algorithm
    Abstract: In this work, we propose a generalization of the forward-forward (FF) algorithm that we call the predictive forward-forward (PFF) algorithm. Specifically, we design a dynamic, recurrent neural system that learns a directed generative circuit jointly and simultaneously with a representation circuit, combining elements of predictive coding, an emerging and viable neurobiological process theory of cortical function, with the forward-forward adaptation scheme. Furthermore, PFF efficiently learns to propagate learning signals and updates synapses with forward passes only, eliminating some of the key structural and computational constraints imposed by a backpropbased scheme. Besides computational advantages, the PFF process could be further useful for understanding the learning mechanisms behind biological neurons that make use of local (and global) signals despite missing feedback connections [11]. We run several experiments on image data and demonstrate that the PFF procedure works as well as backprop, offering a promising brain-inspired algorithm for classifying, reconstructing, and synthesizing data patterns. As a result, our approach presents further evidence of the promise afforded by backprop-alternative credit assignment algorithms within the context of brain-inspired computing. Paper: https://arxiv.org/pdf/2301.01452.pdf submitted by /u/radi-cho [link] [comments]  ( 57 min )
    [D] On generated content and the future of moderation
    Over the past three years, the field of ML has advanced considerably in the field of audio, visual, and natural language generation. For users like me, GPT-3 was a first look into the types of content that can now be generated with minor effort. While impressive at first, the outputs from the original GPT-3 can quickly be seen to be less than ideal and often times can be easily distinguished from original content written by users. Three years later, generation techniques have improved to the point where the task of detecting generated content is far more difficult as the quality of the generated content has risen considerably. Access to such technologies has also spread to the point where states such as New York sees it as enough of a threat to ban it from schools. While I think we are still at the calm before the storm in regards to the potential for chaos such models have, I'd like to open the floor up for a discussion on the implications of generative models and ways we can address it. Will it even be possible to moderate content in the future when models improve to the point where artifacts from the generation process are no longer present? Sure we can have models that detect NSFW content, but what about content that contains information that is false and harmful? Perhaps a resurgence in symbolic AI and rule based reasoning is needed? Or perhaps a renewed interest in the field of argument mining? submitted by /u/sparkinflint [link] [comments]  ( 57 min )
    [D] Recommendation for best toolkit to manually annotate tunnels in 3D Volume?
    I have a 3D tiff stack which contains some holes inside the volume that branch out in various directions. I wanted to annotate the holes inside the volume and I came across several tools that I can use to manually annotate them like 3D slicer, ITK-SNAP, and ImageJ. But I am unfamiliar with all of these tools and I was wondering which one would be most helpful for me? My ultimate goal is to apply volume registration using the annotated holes as keypoints to fuse volumes together. submitted by /u/waterstrider123 [link] [comments]  ( 56 min )
    [D] Visualizations for NSFW models
    Hi all, I am looking for someone to help me for my research project. I want to use grad-CAM (or any other tool) to visualize state-of-the-art cnn predictions like those of Clarifai. submitted by /u/jeditwisted [link] [comments]  ( 54 min )
    [P] A small tool that shuts down your machine when GPU utilization drops too low.
    Hey /r/machinelearning, Long time reader, first time posting non-anonymously. I've been training models using various cloud services, but as an individual user it's stressful for me to worry about shutting down the instances if training fails or stops. Crashes, bad code, etc can cause GPU utilization to drop without the program successfully "finishing", and this idle time can cost a lot of money if you don't catch it quickly. Thus, I built this tiny lil tool to help. It watches the GPU utilization of your instance, and performs an action if it drops too low for too long. For example, shutdown the instance if GPU usage drops under 30% for 5 minutes. It's easy to use and install, just pip install gpu_sentinel If this is useful please leave comments here or on the Github page: https://github.com/moonshinelabs-ai/gpu_sentinel I'm hoping it helps save some other folks money! submitted by /u/nateharada [link] [comments]  ( 62 min )
    [D] I’m a Machine Learning Engineer for FAANG companies. What are some places I can get started doing freelance work for ML?
    I have around 6 YoE doing MLE full time work for various companies. Starting to get tired of working for these big companies and would prefer trying some freelance work. Where are some websites or places I can get started? I’ve seen UpWork, but this seemed more suited for quick one off, software work and less for complex ML tasks last time I was on there (tried that several years ago in 2019). submitted by /u/doctorjuice [link] [comments]  ( 60 min )
    [D] Fine-tuning open source models on specific tasks to compete with ChatGPT?
    As the title says, I'm curious about using open source models like GPT-J, GPT-NeoX, Bloom, or OPT to compete with ChatGPT for *specific use-cases* such as explaining what a bit of code does. ChatGPT does this task quite well, but it's closed-source nature prevents it from being useful in documenting or commenting proprietary code. There's also limitations such as the amount of text ChatGPT will read or respond with. Getting beyond these limitations is something I'm interested in pursuing, perhaps with the help of somewhere in this subreddit. Some assumptions you can safely make: We can get (lots of) funding for the training, hardware, etc... The end product should be on-premises The inference does not actually need to run very quickly. If it costs millions to buy enough GPUs just due to VRAM limitations, we could simply run on CPUs and utilize ram, as long as inference could be done a few times per day. So I guess my questions are where would we start? What model is best to fine-tune? How would you specifically fine-tune to improve specific use cases? submitted by /u/jaqws [link] [comments]  ( 56 min )
    Looking for papers to warm start a BERT large mode from BERT base. Are there papers around it?
    Warm starting the model training of BERT large using BERT base. One idea is to concatenate a bunch of parameters and start training. I was thinking is there a research paper that tries out the best methods? submitted by /u/Plane-Interaction-68 [link] [comments]  ( 52 min )
    [D] Tim Dettmers' GPU advice blog updated for 4000 series
    The legendary Tim Dettmers has updated his blog on which GPU to purchase for Deep learning to include advice for the latest GPU series: https://timdettmers.com/2023/01/16/which-gpu-for-deep-learning/ submitted by /u/init__27 [link] [comments]  ( 60 min )
    [D] The Illustrated Stable Diffusion (Video)
    I'll be honest with you, it took me months to wrap my head around diffusion models. A couple of iterations of a blog post later and this is my best shot at a gentle intro to Stable Diffusion and how it works. https://youtu.be/MXmacOUJUaw The part that took the most reworking is forward diffusion and how to best describe it. Thanks to the many people acknowledged in the blog post who have helped me both understand it and explain it better. Hope you find it helpful. Let me know if you have any questions or feedback. submitted by /u/jayalammar [link] [comments]  ( 57 min )
    [D] Can ChatGPT flag it's own writings?
    My question is, if it is possible to feed a direct quote into ChatGPT and ask it if ChatGPT is the author of said quote? If not, is it reasonable to insist that it can do so in the future? submitted by /u/MrSpotgold [link] [comments]  ( 69 min )
    [R] [2301.00250] DensePose From WiFi
    submitted by /u/GreatCosmicMoustache [link] [comments]  ( 50 min )
    [D] Grid searching data pre processing permutations when training models on structured data.
    Hello, I am currently working on structured data classification problem for work. I was applying multiple different data pre processing steps including imputing null values (mean, KNN, random forrest), adding synthetic data (SMOTE, ADAYSN or None), normalization (l1, l2, max or none), multiple datasets (including different sets of features), as well as different models (XGBoost, Random Forrest, Logistic Regression, KNN, MLP). What I built was a tool that trains all the different permutations of data processing, datasets and models to find the best one, and applied K-Fold cross validation. The tool stores all the data and metrics using MLFlow. This is similar to a grid search across hyperparemeters, but instead of tuning the hyper parameters, I am tuning the data processing steps. I like this method because I gain a level of confidence knowing that I have exhausted all the possible models, data, and pre processing permutations when selecting the best performing model. I was wondering if other people apply a similar technique for structured data problems? Besides the compute is there anything to be cautious of when applying this method? submitted by /u/spiritualquestions [link] [comments]  ( 57 min )
    [D] SOTA on multiple image generation from text
    Wondering what the state of the art is for multiple image generation for an input text, or a series of input texts. To clarify, are there any models or architectures that explore consistency between image generation. (E.g stylistically, same people in the images, same settings, etc) I imagine there would be some pre-existing architectures that could take an image embedding along with a text embedding submitted by /u/weelamb [link] [comments]  ( 56 min )
    [P] Nano GPT
    submitted by /u/trekhleb [link] [comments]  ( 56 min )
  • Open

    Pretraining quadrupeds: a case study in RL as an engineering tool
    submitted by /u/robotphilanthropist [link] [comments]  ( 54 min )
    Is there a publicly available state space model for the Lunar Lander environment?
    The Lunar Lander environment uses the box2d engine to simulate physics. I was wondering if there is code somewhere which explicitly models the environment as as state-space model? LunarLander code: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py submitted by /u/HazrMard [link] [comments]  ( 57 min )
    Poker (NLH) model?
    Is there any open source model for online poker yet? Of course Pluribus was a big deal a few years ago but it’s closed source (and much has changed since), but with the recent OS Rocket League AI stomping pros I have to wonder why nothing has come to the surface with poker yet. Even a 5% improvement on human play would be a big deal in the long run. Is poker that hard? Or is there some model I’m unaware of? Thanks submitted by /u/enterguild [link] [comments]  ( 54 min )
    Need help creating an action space compatible with stable baselines
    I'm trying to train a bot to play a game and am having trouble creating an action space to handle the inputs, which are the wasd keys, space bar for jumping, left click for shooting, and also two continuous values to indicate the coordinates the mouse should move to. At first, I tried to use spaces.Tuple to combine a MultiDiscrete space for the key presses, and a Box space for the mouse movement. However, I quickly found that none of the stable baseline models support tuples. So I looked online and found an idea to change all of my discrete values to continuous values and round to the nearest integer. This sounded promising, so I created an action space like so: # Game window bounds to provide range mouse can move windowX = self._game_window_bounds[0] windowY = self._game_window_bounds[…  ( 63 min )
    SKRL (reinforcement learning library) version 0.9.0 is now available!
    skrl-v0.9.0 is now available! skrl is an open-source modular library for Reinforcement Learning written in Python (using PyTorch) and designed with a focus on readability, simplicity, and transparency of algorithm implementation. In addition to supporting the OpenAI Gym / Farama Gymnasium, DeepMind, and other environment interfaces, it allows loading and configuring NVIDIA Isaac Gym and NVIDIA Omniverse Isaac Gym environments, enabling agents’ simultaneous training by scopes (subsets of environments among all available environments), which may or may not share resources, in the same run. Visit https://skrl.readthedocs.io to get started!! ​ The major changes in this release are: Added Support for Farama Gymnasium interface Wrapper for robosuite environments Weights & Biases integration Set the running mode (training or evaluation) of the agents Allow clipping of the gradient norm for DDPG, TD3, and SAC agents Initialize model biases Add RNN (RNN, LSTM, GRU, and any other variant) support for A2C, DDPG, PPO, SAC, TD3, and TRPO agents Allow disabling training/evaluation progressbar Farama Shimmy and robosuite examples KUKA LBR iiwa real-world example More benchmarking results Changed Forward model inputs as a Python dictionary [breaking change] Returns a Python dictionary with extra output values in model calls [breaking change] Adopt the implementation of terminated and truncated over done for all environments Fixed Omniverse Isaac Gym simulation speed for the Franka Emika real-world example Call agents' method record_transition instead of the parent method to allow storing samples in memories during the evaluation Move TRPO policy optimization out of the value optimization loop Access to the categorical model distribution Call reset only once for Gym/Gymnasium vectorized environments Removed Deprecated method start in trainers submitted by /u/Toni-SM [link] [comments]  ( 60 min )
    Hyperparameters for pick&place with Franka Emika manipulator
    I'm trying to solve pick&place (and possibly also the other tasks in this repository) with Franka Emika Panda manipulator simulated in Mujoco. I've tried for long with stable_baseline3 but without any results, someone told me to try with RLLib because has better implementation (?), but still I can't find any solution... submitted by /u/riccardogauss [link] [comments]  ( 51 min )
    Best Books to Learn Reinforcement Learning
    submitted by /u/Lakshmireddys [link] [comments]  ( 53 min )
    I'm understanding theory; hard time figuring out how to implement it
    Currently, I'm following David Silver's course along with Sutton and Barto's Introduction to Reinforcement Learning. While these are both fantastic I'm having a hard time thinking of how I can actually implement them in code; mainly getting the environment and agent to be connected. Any help would be appreciated. ​ EDIT: In general I'm also interested in how exactly models stay trained in an environment as I would imagine the program would have to run continuously or else it would have to relearn the task every time. submitted by /u/CaptiDoor [link] [comments]  ( 56 min )
    Question about designing the reward function
    Hi all, I am struggling to design a reward function for the following system: It has two joints, q1 and q2 that can not be actuated at the same time. Once q1 is actuated, the system has to wait for 5 seconds to activate q2. The task is to reach a goal position x and y with the system by interchangeably using q1 and q2. So far the reward function looks like this: reward = 1/(1+pos_error) And the observation vector like this: obs = (dof_pos, goal_pos, pos_error) To make the robot interchangeably use q1 and q2, I use two masks: q1_mask = (1, 0) and q2_mask= (0,1) that are interchangeably used to only actuate one joint at the same time. But I am not sure how to implement the second condition that the system needs 5 seconds to activate q2 after q1. So far I am just storing the time that q1 has been activated and replace the actions by 0: self.actions = torch.where( (self.q2_activation > 0) & (self.q2_activation_time_diff > 5) , self.actions * q2_mask, self.actions ) I think the agent gets irritated by simply as nothing as changed by the actions. How would approach for this problem? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 59 min )
  • Open

    It’s No Big Deal, but ChatGPT Changes Everything – Part I
    Time to catch the ChatGPT craze!  Yes, everyone is flocking to the ChatGPT AI-driven chatbot and asking all sorts of life altering questions such as instructions for removing a peanut butter sandwich from a VCR in Biblical verse in Figure 1 (my answer…you’ve still got a VCR?). Figure 1: ChatGPT in Action! Heck, Ryan Reynolds… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part I The post It’s No Big Deal, but ChatGPT Changes Everything – Part I appeared first on Data Science Central.  ( 23 min )
  • Open

    Hello! Has anyone tried to create a neural network to find all PI numbers?
    submitted by /u/ScIentIaEstP0tentIa [link] [comments]  ( 58 min )
    Reverse Engineering a Neural Network's Clever Solution to Binary Addition
    submitted by /u/nickb [link] [comments]  ( 55 min )
    🚀Muse:Text 2 Image Generation via Masked Generative Transformers
    submitted by /u/oridnary_artist [link] [comments]  ( 55 min )
    Does the "optimal structure" depend on the size of the sample or just the complexity of the problem?
    Hi everyone. First post in reddit in general. Let me start by saying that I am more or less new to neural networks and self-taught in the subject. I am learning Neural Network as part of my PhD in energy engineering. Basically, I deal with optimisation of energy systems (more precisely, Concentrated Solar Power plants) and I want to use neural networks to create a surrogate model of certain parts of my detailed and time-consuming models to later apply optimisation. So, if I am not wrong, I am dealing with a very classic "function approximation" problem. I want to train a neural network for an specific application. To do so, I gathered a large data set for my detailed model and then trained multiple networks of different number of neurons (considering only one hidden layers right now). As a result, I obtained an optimum number of neurons which is the smallest to achieve a certain error (measured through the RMSE). Now my question: imagine you gather a new data set from the same model, but smaller. Could you assume that the optimum structure (number of neurons) is the same? I acknowledge that if the size of the new data set is too small there could be problems of overfitting but, if you can assume that the new data set, although smaller, is still statistically representative of the problem, wouldn't the optimum structure be the same as the optimum structure is just related to the complexity of the problem? Hope I was clear enough, probably my question is very simple. Thanks! submitted by /u/paworod [link] [comments]  ( 55 min )
  • Open

    Zeta sum vs zeta product
    The Riemann zeta function ζ(s) is given by an infinite sum and an infinite product for complex numbers s with real part greater than 1 [*]. The infinite sum is equal to the infinite product, but which would give you more accuracy: N terms of the sum or N terms of the product? We’ll take […] Zeta sum vs zeta product first appeared on John D. Cook.  ( 5 min )
    Approximating pi with Bernoulli numbers
    In a paper on arXiv Simon Plouffe gives the formula which he derives from an equation in Abramowitz and Stegun (A&S). It took a little while for me to understand what Plouffe intended. I don’t mean my remarks here to be criticism of the author but rather helpful hints for anyone else who might have […] Approximating pi with Bernoulli numbers first appeared on John D. Cook.  ( 5 min )

  • Open

    Emma Myers as an Marvel whatif character
    submitted by /u/oridnary_artist [link] [comments]  ( 43 min )
    I find the timing of the ChatGPT release curious.
    Prior to the ChatGPT release, there were several powerful & similar AI systems already in existence .. but little publicised. I wonder if the ChatGPT release was a commercial ploy to beat the (possibly more capable) Google AI systems such as PALM to the market? Or maybe the release was intended to break the wall of secrecy hiding AI systems from public view? Whatever was behind the release of ChatGPT, at least it may force Google to finally give the public access to their systems too. submitted by /u/MrEloi [link] [comments]  ( 49 min )
    Google may use Deepmind's Sparrow as ChatGPT competitor
    submitted by /u/henlo_there_fren [link] [comments]  ( 46 min )
    how would i effectively train an ai on minecraft building?
    i want to make an ai that you can put a prompt into and it makes a minecraft schematic. this sounds easy to me until you really get into the specifics of it. i could train it on a bunch of schematics with their relative names but there wouldnt really be any sense to it. For example- if you train it on a bunch of dragons and a bunch of sunglasses, and then tell it to build a dragon with sunglasses, it wouldnt know "where" to put the glasses relative to the dragon. ​ Whats the best way to go about this? submitted by /u/cbreauxgaming [link] [comments]  ( 49 min )
    So far I have found this tool (chatGPT) amazing in helping me write code, however...
    submitted by /u/oh_you_so_bad_6-6-6 [link] [comments]  ( 45 min )
    Pruning and Quantizing YOLO V7 With Modoptima
    In this blog, I have explained how you can use my library modoptima to optimize the YOLO v7 model improving the inference speed by 3-10 times on good processors. https://medium.com/@vikasojha894/pruning-and-quantizing-yolo-v7-with-modoptima-19c61aff7301 submitted by /u/VikasOjha666 [link] [comments]  ( 48 min )
    YouTube channel on AI architecture concepts?
    Ive got a Masters degree in AI and I've really been enjoying a channel called 2 minute papers, but it's a bit too much focused on results and doesn't go at all into the key insights of each paper. I'm not looking for tutorials or specific implementation details - I'm interested in the architectures and types of neural networks which were used to obtain these amazing results. Does anybody know a YouTube channel (or something similar) which goes briefly into the core technology of each paper? submitted by /u/BagelOrb [link] [comments]  ( 44 min )
    Could Deepmind’s Sparrow be Google’s answer to ChatGPT?
    submitted by /u/liquidocelotYT [link] [comments]  ( 46 min )
    Artificial Intelligence Best Paper Awards Reviewed by Computer Vision News (and much more)
    Dear all, Here is Computer Vision News of January 2023. It includes reviews of 2 Best Paper Award winning research papers. Read 44 pages about AI, Deep Learning, Computer Vision and more - with code! Read online version for free (recommended) PDF version Free subscription on page 44. Enjoy! https://preview.redd.it/1wgoydszp7ca1.jpg?width=400&format=pjpg&auto=webp&s=1303f337dd627a6f9252d3a341c005e3cb06f433 submitted by /u/Gletta [link] [comments]  ( 44 min )
    Open art space
    Hey Reddit, I am new here. I was wondering if anyone knows if there is a art place in London where we can paint or draw for free? submitted by /u/lucasagazzani [link] [comments]  ( 44 min )
    Should I tell my company about Chat GPT to implement it into our workflow or keep silent about it?
    I work for a public institution in a small country in Europe. I figure it’ll take some time until people will hear about it or even try to implement it. I think it’s inevitable though that some day it’ll rule our job market. I’m currently using AI for my own soon to be start-up and also for mundane tasks. My question is: Should I tell the board of my company about it? Since I’m on a strategic position of some sorts I think it’ll also be great for my reputation to be the first to kickstart it. However, I’m afraid that by opening this pandora’s box I’ll create more competition by showing other people how to use it. I don’t know, what would you do? submitted by /u/Darklan [link] [comments]  ( 56 min )
    What will BMW M and Mercedes AMG Cars Look Like in the Future?
    submitted by /u/BallbustCuck [link] [comments]  ( 44 min )
    Bavaria-based mobility company German Bionic has developed an AI-powered exoskeleton that's designed to help workers carry out physically demanding jobs
    submitted by /u/Rollyman1 [link] [comments]  ( 44 min )
    Excuse me? How is airsoft not following their policy?
    submitted by /u/vajenny_zlacyniec [link] [comments]  ( 44 min )
    CHATGPT D&D Graphic Novel with Dalle-e and Azure Voice-Over
    submitted by /u/erikmalkavian [link] [comments]  ( 53 min )
    Inpainting with the Visuali editor (beta)
    submitted by /u/aigeneration [link] [comments]  ( 46 min )
    AI for your own files
    Hi there, I was wondering if there is some sort of tool out there that allows you to have some sort of localised AI - as in it searches all my files, but also within them. Say theres a poerpoint file and one of the slides has some relative content, it will show that slide and not just the file itself. I run a consultancy and have 1000s of documents from past campaigns and clients, it would be great to be able to teach some form of AI about my content and then it finds things or creates things when I need it.... thanks! submitted by /u/Vincenth2008 [link] [comments]  ( 47 min )
    How ChatGPT would have changed my life
    submitted by /u/DeeMore [link] [comments]  ( 61 min )
    Box2Mask: A Unique Method for Single-Shot Instance Segmentation that Combines Deep Learning with the Level-Set Evolution Model to Provide Accurate Mask Predictions with only Bounding Box Supervision
    submitted by /u/ai-lover [link] [comments]  ( 48 min )
    AI-Developed, Synthetic DNA is About to Revolutionize Drug Production and Gene Therapy
    submitted by /u/digitalgoldnow [link] [comments]  ( 43 min )
    Build a simply GPT-3 chatbot in Python in 20 lines of code in 5 minutes
    submitted by /u/techie_ray [link] [comments]  ( 43 min )
    AI text to art generation explained simply with pen and paper
    submitted by /u/techie_ray [link] [comments]  ( 46 min )
    Unpopular opinion: AI will make jobs more boring
    One claim I hear a lot in AI circles is that in the future AI will replace a lot of bullshit jobs, which will free up time for humans to focus on what matters. No more transcribing emails for a living. New, meaningful jobs will be created instead. In other words, the hope is that in 10 years more people will be psychologically happy with their jobs than today. I'm growing wary of the opposite risk. If we consider "bullshit" the part of a job that can be automated away, then sometimes the bullshit part of a job is what makes it pleasurable and fulfilling (provided it is not the only thing). As an artist you may draw pleasure in doing coloring after a pencil sketch. As an engineer, you may like a day without planning or meetings where you can focus on programming a small piece of code. As a blog post writer you may enjoy, well, writing. AI are quickly becoming able to replace all these tasks, and to stay competitive people will increasingly be required to offload parts of their jobs to them. I can see a future where jobs in any field start looking the same - the person is in charge of having higher knowledge about the problem, planning tasks for AI, be able to evaluate AI output, and assemble final product. This is certainly not a bullshit job, but I also think that for many people this is not going to be a fulfilling job. submitted by /u/R_y_n_o [link] [comments]  ( 46 min )
    Our horror game story is crafted with ChatGPT and this scene is part of it. What do you think about it?
    submitted by /u/Leaderide [link] [comments]  ( 45 min )
    About art AIs, how noise works?
    I have heard that art AI converts images into "noise" to create their models. Question: Can you revert the "noise" into the image it was based into? E.G I take the mona lisa, convert it into "noise" so the AI can understand it. Can I then request the AI to use the mona lisa "noise" to generate the mona lisa back again? Or there is no way to return it back to the image after it is converted into noise? (not the exact image, but its equivalent based into the noise stored). submitted by /u/___Marshmallow___ [link] [comments]  ( 45 min )
  • Open

    Emma Myers as an Marvel whatif character
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
    Are there any DNNs that perform gradient descent in runtime (mesa alignment)?
    I was watching the Robert Miles series on AI alignment, specifically the one about mesa optimizers, and when he started talking about the mesa objective I noticed that the way in which he defined the base objective (gradient descent + loss function + train data) it's not analogous to the way in which the mesa objective would work (as the latter cannot change any parameters of the network, as no Gradient Descent step is used or anything analogous). This got me thinking if are there any papers that implement runtime parameter change, through gradient descent or not. submitted by /u/not-alredy-taken [link] [comments]  ( 55 min )
    Electronic circuits analogies with Deep Neural Networks
    I was wondering these days about the training parameter in tensor flow layers, and stumbled upon the idea of letting the activation function itself be designed in such a way that it’s backward pass yields no gradients, independent (or almost) of the loss in use. Do you know any? Then it came to me that this resembles a buffer op amp, and I was wondering if are there out there any papers that explore circuit analogies with the training process, treating the inputs and gradients as “currents”. Seems like an interesting concept ! submitted by /u/not-alredy-taken [link] [comments]  ( 58 min )
  • Open

    [D] What kinds of interesting models can I train with just an RTX 4080?
    I'm aware transformers are pretty vram hungry and a 4080 only has 16 GB. So I am guessing a lot of transformer based models will be out of the question. At least anything that is interesting. Not sure about other models though. Is there anything I can do with a 4080 that's beyond just some toy experiment? submitted by /u/faker10101891 [link] [comments]  ( 55 min )
    [D] What is standard practice in RL when reporting average returns across multiple seeds in a table or a plot?
    Hey everyone, This may be a silly question but I'm confused as to what standard practice is when reporting average returns across multiple seeds in a table or a plot. It's usually not even mentioned but I sometimes see authors mention they are using: Average ± Standard Deviation Average ± Standard Error Average ± 1.96 * Standard Error Bootstrapped CIs For example, this paper (https://www.jmlr.org/papers/volume23/21-1342/21-1342.pdf) by the authors of clearnrl doesn't specify anything other than that the "reported numbers are the final average episodic returns of at least 3 random seeds". What would you consider best practice in RL? submitted by /u/thekingpenguin3 [link] [comments]  ( 57 min )
    [P] Parameter optimization on a guitar amp emulation
    So I'm working on project which goal is to digitally emulate guitar tube amplifiers using a Wiener-Hammerstein model. For those of you unfamiliar with this type of model, its key block is a nonlinear block that is characterized by a set of 8 parameters. Basically there's a raw input guitar signal and there's an output signal that should be as close as possible to the actual output of the modeled amplifier. I have a database with a series of magnitude-variable chirp signals serving as inputs and the respective output measurements. So my question is what is the best way for me to automate the process of optimization of this set parameters. I thought of using a genetic algorithm but I wondered if that's the most accurate and efficient way of doing it. Also this is to be implement on a microcontroller so I have more limited resources than a computer. However, it would be really cool to be able to customize these parameters in real time on my Teensy 4.0 so it would be ideal that the algorithm could meet this condition, although it's not completely necessary submitted by /u/syko101 [link] [comments]  ( 57 min )
    [P] Summaries of Ten Interesting & Influential Papers I read in 2022
    submitted by /u/seraschka [link] [comments]  ( 91 min )
    [Discussion] can't lives down on won't street
    submitted by /u/Psqwanio [link] [comments]  ( 53 min )
    [D] AI Security - Gumroad AI help bot refuses to answer certain questions... unless you PRIME it with a question that it WILL answer first
    submitted by /u/LatentWeb [link] [comments]  ( 53 min )
    [P] Modified kmeans algorithm returns the wrong answer
    I am trying to create a kmeans algorithm that is based on the Earth Movers Distance instead of the Euclidean distance. However, when I run it, it just returns the same value for all data points. ​ The input is an dxn matrix containing all of my n probability distributions. ​ Here is an example of running the algorithm. The clusters should be much more distributed. ​ distribution = {} num_bins = 5 for i in data: distribution[i] = np.histogram(data[i], bins = num_bins)[0] / len(data[i]) ​ Z = np.zeros((len(data), num_bins)) for i in range(len(Z)): Z[i] = distribution[list(distribution)[i]] Z = Z.T ans = k_means_algorithm(Z, 8, proportionally_random_k) ​ res = points_to_clusters(Z, ans) print(res) ​ [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. …  ( 58 min )
    [R] hlb-CIFAR10 0.2.0: New world record for single-GPU CIFAR10, ~<12.38s with one A100 (SXM4, Colab)
    submitted by /u/tysam_and_co [link] [comments]  ( 68 min )
    Best Predictive Model to Predict Total Monthly Stock Returns on Panel Data [P]
    Hello Reddit community, I am faced with some Finance coursework and I would appreciate any help or guidance from experienced practitioners in the ML/Finance industry. I have a panel dataset (Data File) - that includes information such as date, stock name, market-cap, sector, and a column for the target variable which is the monthly absolute return of the stock in percentage terms (i.e one month from the date column). I am tasked with building a predictive model to forecast the target variable. I would like to know information on which ML model you would recommend, and why? Thank you in advance for any help or guidance provided. submitted by /u/RhiteousRhino [link] [comments]  ( 56 min )
    [P] Free PyTorch Deep Learning class, from Perceptrons to multi-GPU training and cloud deployment
    submitted by /u/seraschka [link] [comments]  ( 55 min )
    [D] Problem with predict/evaluate and batch_size keras
    So, i trained my unet using keras. Best dicescore of saved model supposed to be 0.817. Then i ran a hand-made score prediction: batch_size = 1 n_val_img = len(os.listdir(os.path.join(fp2,"sujetos"))) vspe = n_val_img//batch_size dice = 0 for _ in range(0,vspe): test_image_batch, test_mask_batch = val_gen_ds.__next__() for i in range(test_image_batch.shape[0]): a = my_unet(np.expand_dims(test_image_batch[i], 0)).numpy() predicted_img_th = (a[0,:,:,0]>0.5)*1 dice += Dice(test_mask_batch[i],predicted_img_th).numpy() print(dice/n_val_img) This return around 0.836, so... already different. Then i tried to replicate my evaluate score with different batch sizes: my_unet.evaluate(val_gen_ds,batch_size=batch_size,steps=vspe) ​ https://preview.redd.it/pmt5b358e8ca1.png?width=567&format=png&auto=webp&s=bbd24505bdb55b66a3cab12007cc8d628e533907 Clearly, it doesn't make sense to me... What's wrong? what am i missing? submitted by /u/SerDetestable [link] [comments]  ( 58 min )
    [R] HYPERREAL — high fidelity 6dof video with ray-conditioned sampling
    submitted by /u/SpatialComputing [link] [comments]  ( 55 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 52 min )
    [N] Will w3 see lower costs and faster training? Will Floating Point 8 Solve AI/ML Overhead?
    submitted by /u/RuairiSpain [link] [comments]  ( 53 min )
    [D] Time Embedding in Diffusion Model
    I was looking at how time is embedded in diffusion models, and I found these two implementations [1] and [2]: The first one is a simplified version of the second one, but the idea behind the time embedding is similar. What I've understood is that t is a number, it goes in a SinusoidalPositionEmbeddings with a given time_dim, then Linear + ReLU where the same dimensions are kept. Then for each down-step of the UNet, an additional Linear + ReLU is performed to match the channels of the image embedding, and this latter embedding is added to the output of the CNN. Here when have the time embedding with a shape of (b, c, 1, 1) and the image embedding with a shape of (b, c, h, w). When we perform the addition, the time embedding is broadcasted to match the image embedding. As far as I understand, here the latent space of the image gets reweighted channel-wise, but the same weights are added for each different position. Why they did follow this choice? This is quite different from the standard positional encoding used e.g. in self-attention, where the positional embedding gives a different weight to each spatial dimension. I never found this detail explained in any Diffusion Model paper/tutorial, also if we look at [2], the same idea is made more complex, with more Linear projections and different activation functions (GeLU and SiLU). Moreover, I'm not sure about the difference between applying a time embedding and then directly a conv2d layer against the time embedding + attention + conv2d. Aren't these types of embedding suited up for attention layers? How does a Conv2D layer, which is positional invariant by construction, benefit from this type of operation? [1] https://colab.research.google.com/drive/1sjy9odlSSy0RBVgMTgP7s99NXsqglsUL?usp=sharing#scrollTo=KOYPSxPf_LL7 [2] https://github.com/lucidrains/denoising-diffusion-pytorch/blob/main/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py submitted by /u/Lumett [link] [comments]  ( 60 min )
    [P] I built an app that allows you to build Image Classifiers completely on your phone. Collect data, Train models, and Preview the predictions in realtime. You can also export the model/dataset to be used anywhere else. Would love some feedback.
    submitted by /u/Playgroundai [link] [comments]  ( 63 min )
    [P] I built arxiv-summary.com, a list of GPT-3 generated paper summaries
    Hi there, I wanted to share my new project with you, it is called arxiv-summary.com. Right now, I find it really difficult to keep up with all the important new publications in our field. Especially, it is sometimes difficult to get an overview of a paper to decide if it's worth reading. I really like arxiv-sanity by Andrej Karpathy, but even with that, it can still take some time to understand the main ideas and contributions from the abstract. With arxiv-summary, my goal is to make ML research papers more "human-parsable". The website works by fetching new papers daily from arxiv.org, using PapersWithCode to filter out the most relevant ones. Then, I parse the papers' pdf and LaTeX source code to extract relevant sections and subsections. GPT-3 then summarizes each section and subsection as bullet points, which are finally compiled into a blog post and uploaded to the site. You can check out the site at arxiv-summary.com and see for yourself. There's also a search page and an archive page where you can get a chronological overview. If you have any feedback or questions, I'd be happy to hear them. Also, if you work at OpenAI and could gift me some more tokens, that would be much appreciated :D Thanks and happy reading! submitted by /u/niclas_wue [link] [comments]  ( 65 min )
    [Project] Introducing Visionner (Your image dataset toolkit)
    submitted by /u/charles_data_dev [link] [comments]  ( 58 min )
    [P] C++ wrapper around libsvm and liblinear using Eigen
    I needed a C++ wrapper library around libsvm and liblinear using Eigen so I made one. Maybe it's useful for you as well: https://github.com/bloomen/svmegn submitted by /u/cblume [link] [comments]  ( 56 min )
    [D] Packing multiple shorter training examples in to single sequence in LM pretraining
    I've come across several papers where authors mention they, for reasons of computational efficiency, will pack multiple shorter training examples together in a single sequence. Example from "Scaling Instruction-Finetuned Language Models" (Chung et al, 2022): We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using an end-of-sequence token. Masking is applied to prevent the tokens from attending to others across the packed example boundary. I'm curious to understand how this is actually done in practice. Seeing as multiple separate masks are involved, I'd think one would need to loop over them all and repeat (?) the matrix multiplication several times? Is there some built in functionality in Pytorch and other frameworks to deal with a situation like this with multiple masks? Thankful if someone could share and explain, or link to an implementation of input packing. submitted by /u/mLalush [link] [comments]  ( 59 min )
  • Open

    Online Reinforcement Learning Courses
    Can anyone recommend any online Reinforcement Learning courses, preferably those that have assignments or exercises that are graded so there is some way to check your answers? submitted by /u/Smart-Ground-3587 [link] [comments]  ( 55 min )
    CS234 Stanford
    Can you submit the programming assignments of CS234 without using gradescope? How do you check your work if you are taking the course without being enrolled in it officially? submitted by /u/Smart-Ground-3587 [link] [comments]  ( 55 min )
    Boardgame environment and data sources [help]
    Hi everybody! I'm trying to come up with a fun learning project to execute. I'd like to try to create an agent that plays a boardgame. My plan is to apply reinforcement learning and behavioral cloning. I don't know what boardgame to work with yet, I just want it to be simple enough to work with and fulfill two requirements: The environment must be implemented and accessible. I must have access to a dataset containing historical plays. Would anybody have suggestions on where I could find these? Any responses are appreciated. Thank you! submitted by /u/valahart [link] [comments]  ( 55 min )
    In Asynchronous n-step DQN, is there a global shared gradient vector or gradient vector for each thread?
    In this paper: *Asynchronous Methods for Deep Reinforcement Learning (arxiv.org) This is the pseudocode for n-step DQN: https://preview.redd.it/rjt84n6zs9ca1.png?width=1176&format=png&auto=webp&s=eda075c5741bae96f954432073b0d6617937941a In the pseudocode above it mentions: "Initialize network gradients dtheta <-- 0." Is this referring to a global shared gradient vector or a gradient vector for each thread? I noticed that they use theta instead of theta' making me think it is a global shared gradient vector. But if this is the case, couldn't a thread clear the gradient vector while another thread is accumulating gradients? Also, in section 7 of the paper, where they talk about implementing SGD with Momentum in an Asynchronous setting it seems to imply that it is a global shared gradient vector normally. https://preview.redd.it/souf3gxkt9ca1.png?width=1152&format=png&auto=webp&s=4a0cff8ba6100c651d0ed8b7d34089ec2e3aecd5 submitted by /u/ImNotKevPlayz [link] [comments]  ( 56 min )
    What is data governance and its importance to a company?
    submitted by /u/ranjeettechnincal [link] [comments]  ( 54 min )
    How to interpet negative entropy_loss that keeps decreasing in PPO?
    Why is it negative? Why it keeps decreasing? Will it stop at any point? Is this expecred behavior? If not what should I adjust? I am using Stable Baselines3 ​ https://preview.redd.it/bypjlwy5j9ca1.png?width=372&format=png&auto=webp&s=94fb5fe55b66c90d705c6776ffa5dd258e90d2ba https://preview.redd.it/f0ooky84j9ca1.png?width=1413&format=png&auto=webp&s=0b62f9d8df555835ce72cfe41cfd840c295b70ca submitted by /u/andrew7777777 [link] [comments]  ( 54 min )
    Help with transitioning an existing DQN into a DRQN
    Hi RL reddit, To preface this post, please let me know if I need to clarify any details to receive help and/or guidance. I am new to posting on this subreddit and still consider myself a novice in the deep RL domain. What I need help with is transitioning an existing DQN into a DRQN. The DQN architecture and the environment that it learns comes directly from this paper https://arxiv.org/pdf/1810.04244.pdf To briefly summarize the paper, the author proposes a DQN network as a controller to guide fixed winged aircrafts to follow the evolution of a spreading wildfire (grid environment). The same DQN can be used for the both aircrafts. The inputs are follows: A 5 dimensional vector bank angle of ownship distance to other aircraft bearing angle to other aircraft relative to current he…  ( 66 min )
    Best practices for Self-Play RL
    Hi! I know there is a lot of work on self-play (training RL agents in environments where they play against themselves), and I've found several tricks to stabilize the training process. I was wondering if someone who has experience in this field could provide a compilation of such tricks and best practices, for example: Fictitious self-play: Keeping N previous checkpoints of the agent and sampling from a pool of these to select an opponent every T environment steps. (I have also seen people sample a new opponent after every single env.reset() call, what do you think is best?). What is a reasonable value for T? and for N? KL distillation loss: Adding a KL loss penalty between the current agent being trained and the last checkpoint stored so that the policy doesn't change abruptly. How is this usually implemented? What's a reasonable value for a coefficient for that penalty? Imagine a DQN agent playing against itself, is it reasonable to set epsilon=1 and start annealing it every time a new enemy is sampled? (in case we play against the same enemy for a long time). ​ There might be many more tricks so if we can list them all here that'd be great! Thank you all! submitted by /u/xWh0am1 [link] [comments]  ( 61 min )
  • Open

    Reverse engineering options
    This weekend I saw a sign in the window of a Burger King™ that made me think of an interesting problem. If you know the number of possibilities like this, how would you reverse engineer what the options that created the possibilities? In the example above, there are 211,184 = 213×33 possible answers, and so […] Reverse engineering options first appeared on John D. Cook.  ( 6 min )

  • Open

    Interesting website for people struggling with the concept of Roko's basilisk
    submitted by /u/Fusionism [link] [comments]  ( 45 min )
    Best Artificial Intelligence books for beginners to Experts to read
    submitted by /u/Lakshmireddys [link] [comments]  ( 44 min )
    Is there a tool (or research paper) for training a model on images for inpainting?
    I want to train a ML model on an object and have it be able to inpaint the specific object into other photos. Does a service like this exist? submitted by /u/Zestybeef10 [link] [comments]  ( 45 min )
    AI Sales Chat Bot and Telephone Agent
    Hi there, I'm looking for AI software that can do the following tasks: A fully automated chat bot that can be used to close deals. A telephone bot that can do outreach and also sense if a lead is interested or not. Basically I want to be able to automate the whole outreach + close process by AI so that I can process much more contact data as human sales reps could ever do. If a lead is interested, they should receive a link with an offer where they can directly purchase the digital product that I'm planning to sell. submitted by /u/cokedinosaur [link] [comments]  ( 44 min )
    AI Etsy shop
    Hey guys , I got into the ai space about 3 weeks ago and am stating my Esty shop journey with mainly air created photos of supercars. I would just love some feedback as to what I can add and how to make it as best as possible. I am planning on bringing in large metal canvas soon. Just really wanting feedback on the art work. Thank you!! https://picturetron.etsy.com submitted by /u/BetterPresentation35 [link] [comments]  ( 45 min )
    What is the future of NLP for the coming 24 months? Dall-E clones Mid-Journey and SD took 6-8 months to appear, so is that how long it will take for clones of ChatGPT? Perhaps less time given the higher investment and market potential?
    submitted by /u/MegavirusOfDoom [link] [comments]  ( 50 min )
    Spray-on smart skin uses AI to rapidly understand hand tasks
    submitted by /u/qptbook [link] [comments]  ( 44 min )
    Which A.I. Labs are mostly likely to have offshoots that scale well?
    With ChatGPT and GPT-4 hype on full throttle I anticipate a lot of new A.I. labs and off-shoots of OpenAI and Google Brain and DeepMind will keep forming. Speaking of which, Niki Parmar and Ashish Vaswani, two prominent artificial intelligence researchers who left Google in 2021 to launch Adept , have now departed to make yet another A.I. lab in stealth mode. By 2024, there will be around a dozen good A.I. labs not named OpenAI. Google itself has PaLM with RLHF, Chinchilla, Google Duplex, LaMDA and Sparrow (DeepMind). More foundational models will arrive as A.I. labs make their models public. But who and which one is likely to be good? Anthropic, Adept, Inflection A.I. AI21 Labs, Cohere, there are a lot of potential candidates. This is not counting the ones likely forming in China, a…  ( 49 min )
    Could AI be the answer to our content needs by 2025?
    submitted by /u/Realistic-Plant3957 [link] [comments]  ( 44 min )
    I Created A Reddit Chatbot..(Potentially offensive content)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 48 min )
    What practical applications have you already found for ChatGPT?
    submitted by /u/DrMelbourne [link] [comments]  ( 46 min )
    ChatGPT will undoubtedly change the world. The question is HOW? What are your thoughts?
    submitted by /u/DrMelbourne [link] [comments]  ( 46 min )
    AI that you give an image to and it says colors that can compliment it?
    Title says it all, is there an AI that you can insert in an image and it says some colors that could work with it? submitted by /u/NewShibeAccount [link] [comments]  ( 46 min )
    Do you want to understand how an end-2-end paraphrase app can be created?
    Check out this medium article about how to create such an app: https://medium.com/towards-artificial-intelligence/how-to-create-an-end-2-end-text-paraphrase-app-db83a4e05918 Or check this repository about Quotera a paraphrase app to be deployed via Streamlit or FastAPI and Docker in Python: https://github.com/stavrostheocharis/quotera submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 45 min )
    THIS BOOK WAS WRITTEN ENTIRELY BY CHATGPT.
    submitted by /u/__sandeepan__ [link] [comments]  ( 43 min )
    Bagging vs Boosting Explained
    Hi guys, I have made a video on YouTube here where I cover the Bagging and Boosting ensemble learning algorithms. I present how both work, and discuss their similarities and differences. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 47 min )
    Top A.I. Powered Tools Not Named ChatGPT
    submitted by /u/BackgroundResult [link] [comments]  ( 46 min )
    The misuse of AI is the familiar promise one thing and deliver something else
    submitted by /u/shanoshamanizum [link] [comments]  ( 45 min )
    Will there be a universal AI tool?
    When thinking about the possibilities of artificial intelligence, I couldn't stop thinking about some sort of interactive Wikipedia, imagine all the objective knowledge of the universe compiled into a firmware pre-installed app just like the 'Calculator'. Another type of device that could become an incredible tool for solving problems, might even be useful for business or professionals. We could name it the 'Communicator', just like a calculator you enter endless math problems, with the communicator you could enter endless word problems. I think we could grab much more productivity out of this technology than just asking philosophical questions to Chat GPT. submitted by /u/jvazorka03 [link] [comments]  ( 45 min )
    I wrote a blog post about use of AI in Music Industry. Take a Look!
    https://link.medium.com/HnMPSMA4zwb submitted by /u/Yigit_im [link] [comments]  ( 44 min )
    An interactive AI training simulation using Genetic Algorithm
    submitted by /u/SparshG [link] [comments]  ( 44 min )
    Bright Eye: free Mobil AI app that generates art, code, poems and essays, and more.
    Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA(Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature (fantasy, folklore, drama, fiction, fable, etc). It’s sort of like chat-gpt. We’d appreciate it if you could try it out and let us know your thoughts: https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/True-Marketing-5079 [link] [comments]  ( 46 min )
    I'm Compiling a List of Helpful AI Tools, Feel free to add any you've created/discovered
    submitted by /u/secret-millionaire [link] [comments]  ( 45 min )
    We created a list of AI projects and applications in Github
    There are incredible applications built using AI. It is definitely a trend that the world should not ignore. We started to maintain a collection of cool ai projects in Github: https://github.com/ai-collection/ai-collection Our mission is to increase reach and visibility for these awesome projects! It is updated daily and we hope that with the help of the community, it will be a great source for discovering AI applications. submitted by /u/beth0io [link] [comments]  ( 46 min )
  • Open

    Best Neural Networks Courses on Udemy to Consider
    submitted by /u/Lakshmireddys [link] [comments]  ( 52 min )
    New Abilities Emerge If Language Models Are Scaled Past Critical Point ⭕
    Last year, large language models (LLM) have broken record after record. ChatGPT got to 1 million users faster than Facebook, Spotify, and Instagram did. They helped create billion-dollar companies, and most notably they helped us recognize the divine nature of ducks. 2023 has started and ML progress is likely to continue at a break-neck speed. This is a great time to take a look at one of the most interesting papers from last year. Emergent Abilities in LLMs In a recent paper from Google Brain, Jason Wei and his colleagues allowed us a peak into the future. This beautiful research showed how scaling LLMs might allow them, among other things, to: Become better at math Understand even more subtleties of human language reduce hallucination and answer truthfully ... (See the plot o…  ( 77 min )
    Bagging vs Boosting Explained
    Hi guys, I have made a video on YouTube here where I cover the Bagging and Boosting ensemble learning algorithms. I present how both work, and discuss their similarities and differences. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 56 min )
    Help visualizing Spiral dataset
    Hello I want to visualize neural network learning to fit the spiral data set, I am able to plot the given data. What i want is to see is how my neural net is adjusting to fit data points , forming spiral shape, The problem is i don't know what to plot or how to do it, My softmax function outputs probability distribution and my categorical function outputs loss, so i'm lost. submitted by /u/Purple_Gen3 [link] [comments]  ( 55 min )
  • Open

    [D]: Are there models like CODEX but work in a reversed way?
    Many models these days focus on code generation. But I was wondering if there's anything for understanding existing codebase? I know that Codex or ChatGPT can understand what a function does, but what about a complex codebase with imports and nested calls? Are these models capable of understanding the relationship between functions? I'm trying to build a side project where you give it a production level codebase, it does some magic, then I can ask AI anything about things in this codebase with high accuracy. submitted by /u/GoodluckH [link] [comments]  ( 58 min )
    [D] Leveraging multiple photos for super resolution / restoration
    Is there any work that does this? Let's say you have 20 good, related photos, and one bad one you want to restore / upscale / denoise / sharpen / inpaint etc. Those 20 good images of the same person / object / building should give the model a good sense of how to fill in missing details in the bad photo. I imagine there's work somewhere in this direction but can't find anything. submitted by /u/anonDogeLover [link] [comments]  ( 55 min )
    [P] handlingclassifier.ml - predicts size category from provided product name - works with IKEA-like range of products
    submitted by /u/curryprogrammer [link] [comments]  ( 55 min )
    [P] Question regarding ID3 and cross validation
    I created an ID3 algorithm using scilab for a project at university. The project is more a proof of concept then an having an actual usecase. Its written in scilab without using any toolboxes and classifies if you have won in tic tac toe. My code basically uses a dataset that has every possible endgame board configuration of tic tac toe and builds a descision tree. I can then input a specific endgame board configuration and it tells me if i won or not. It works fine so far and since my dataset has all possible configurations, predicts the correct label 100% of times. I now added a ten times k-fold cross validation algorithm. However, the validation only gives me an accuracy of about 80%. Am I missing something here? Does a cross validation even make sense if my training set contains all possible data points? Hope someone can give some answers. submitted by /u/i-dunnodude [link] [comments]  ( 57 min )
    [R] Photorealistic human image editing using attention with GANs
    submitted by /u/psarpei [link] [comments]  ( 58 min )
    [D] Speaker diarization: reusing fitted speaker embedding clusters?
    I am trying to create speaker-aware transcripts from (multiple) audio files of a podcast. Right now I'm using OpenAI Whisper for the transcripts and pyannote.audio for speaker diarization (speaker segmentation + centroid clustering) In order to speed up the process (diarization time doesn't seem to scale linearly), I'd like to fit the centroids with the first audio file, and use those to predict the speakers (clusters of the speaker embeddings) of the other audio files, as the speakers don't change across episodes. However, the default pyannote.audio diarization pipeline refits the clusters for each audio file. Do you know of any other Python framework that allows reusing the fitted clusters, or any way pyannote.audio allows this? Is this even possible? Any other way to achieve the desired results? submitted by /u/2blazen [link] [comments]  ( 57 min )
    [Project] Stable Diffusion Pokémon cards
    submitted by /u/thundergolfer [link] [comments]  ( 62 min )
    [R] Differentiable Point-Based Radiance Fields for Efficient View Synthesis
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 55 min )
    [R] from a human motion sequence, SUMMON synthesizes physically plausible and semantically reasonable objects
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 62 min )
    [R] Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement - TeachMe - Bhavana Dalvi Mishra et al Allen Institute for AI
    Paper: https://arxiv.org/abs/2204.13074 Blog: https://blog.allenai.org/towards-teachable-reasoning-systems-dd16659fd9f8 Youtube: https://www.youtube.com/watch?v=c5j_tWsENFg Abstract: Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of user feedback, containing user-supplied corrections to erroneous model beliefs that users identify during interaction. Retrievals from memory are used as additional context for QA, to help avoid previous mistakes in similar new situations - a novel application of memory-based continuous learning. With simulated feedback, we find that our system (called TeachMe) continually improves with time, and without model retraining, requiring feedback on only 25% of training examples to reach within 1% of the upper-bound (feedback on all examples). Similarly, in experiments with real users, we observe a similar trend, with performance improving by over 15% on a hidden test set after teaching. This suggests new opportunities for using frozen language models in an interactive setting where users can inspect, debug, and correct the model's beliefs, leading to improved system's performance over time. https://preview.redd.it/umosmtgzj0ca1.jpg?width=507&format=pjpg&auto=webp&s=65f66ed1230ae6bdce73a86386fbfcc860cd4c59 https://preview.redd.it/5lbhvwgzj0ca1.jpg?width=680&format=pjpg&auto=webp&s=99b402f6db395a62756113b2f8cb879667d444ef https://preview.redd.it/jd7oaygzj0ca1.jpg?width=1308&format=pjpg&auto=webp&s=4fac23ba2ecd68eb5b744f6c10a9c09ba604376c https://preview.redd.it/q137kkhzj0ca1.jpg?width=839&format=pjpg&auto=webp&s=24052a161029b23b32944954f88f432348016ea0 submitted by /u/Singularian2501 [link] [comments]  ( 61 min )
    [N] Class-action law­suit filed against Sta­bil­ity AI, DeviantArt, and Mid­journey for using the text-to-image AI Sta­ble Dif­fu­sion
    submitted by /u/Wiskkey [link] [comments]  ( 76 min )
    [D] What's hot for Machine Learning Research in 2023?
    Which of the sub-fields/approaches, application areas are expected to gain much attention (pun unintended) this year in the academia? PS - Inspired from a similar question last year (https://www.reddit.com/r/MachineLearning/comments/t04ekm/d_whats_hot_for_machine_learning_research_in_2022/) submitted by /u/Aromatic_Eye_6268 [link] [comments]  ( 57 min )
    [D] Is MusicGPT a viable possibility?
    As in, "Pink Floyd, Another Brick in the Wall, ska, heavy trumpet, female vocalist" It seems that if copyright issues are a controversial element of AI art, then copyrighted music will run into the same issue. Or is this not true? submitted by /u/markhachman [link] [comments]  ( 61 min )
  • Open

    Is there a better server alternative than AWS/Azure/Nvidia...for students?
    I'm a student and I've gotten to the part of my machine learning project where I need to optimize a lot. When I say a lot, I mean a lot, I have complex models. Most people these days usually pay for the services of "big tech companies" like Amazon, Microsoft, etc. to get their models trained. But I think in my case it would cost a lot of money, all tho I am aware that some have student discounts. Are there any alternatives like universities that allow students to do this or something else entirely? If not, which of these companies would you recommend best in terms of computing/price ? Thanks for all the replies submitted by /u/Apprehensive_Rush314 [link] [comments]  ( 54 min )
    _Rocket League_ RL agent 'Nexto' now in top 0.5% of players
    submitted by /u/gwern [link] [comments]  ( 52 min )
  • Open

    Foreshadowing Page Rank
    Douglas Hofstadter, best known as the author of Godel, Escher, Bach, wrote the foreword to Clark Kimberling’s book Triangle Centers and Central Triangles. Hofstadter begins by saying that in his study of math he “sadly managed to sidestep virtually all of geometry” and developed an interest in geometry, specifically triangle centers, much later. The ancient […] Foreshadowing Page Rank first appeared on John D. Cook.  ( 6 min )
  • Open

    Novelty Socks by AI
    I like a fun sock. The more random the design, the better. What kinds of novelty sock ideas would we get if we used AI as a creativity aid? It turns out they aren't too novel unless the AI is glitchy. I collected 14 examples of socks I  ( 6 min )
    Bonus: More novelty socks
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v4 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.  ( 2 min )
    Bayesian inference via sparse Hamiltonian flows. (arXiv:2203.05723v2 [stat.ML] UPDATED)
    A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during Bayesian inference, with the goal of reducing computational cost. Although past work has shown empirically that there often exists a coreset with low inferential error, efficiently constructing such a coreset remains a challenge. Current methods tend to be slow, require a secondary inference step after coreset construction, and do not provide bounds on the data marginal evidence. In this work, we introduce a new method -- sparse Hamiltonian flows -- that addresses all three of these challenges. The method involves first subsampling the data uniformly, and then optimizing a Hamiltonian flow parametrized by coreset weights and including periodic momentum quasi-refreshment steps. Theoretical results show that the method enables an exponential compression of the dataset in a representative model, and that the quasi-refreshment steps reduce the KL divergence to the target. Real and synthetic experiments demonstrate that sparse Hamiltonian flows provide accurate posterior approximations with significantly reduced runtime compared with competing dynamical-system-based inference methods.  ( 2 min )
    Manifold Fitting under Unbounded Noise. (arXiv:1909.10228v2 [stat.ML] UPDATED)
    There has been an emerging trend in non-Euclidean statistical analysis of aiming to recover a low dimensional structure, namely a manifold, underlying the high dimensional data. Recovering the manifold requires the noise to be of certain concentration. Existing methods address this problem by constructing an approximated manifold based on the tangent space estimation at each sample point. Although theoretical convergence for these methods is guaranteed, either the samples are noiseless or the noise is bounded. However, if the noise is unbounded, which is a common scenario, the tangent space estimation at the noisy samples will be blurred. Fitting a manifold from the blurred tangent space might increase the inaccuracy. In this paper, we introduce a new manifold-fitting method, by which the output manifold is constructed by directly estimating the tangent spaces at the projected points on the underlying manifold, rather than at the sample points, to decrease the error caused by the noise. Assuming the noise is unbounded, our new method provides theoretical convergence in high probability, in terms of the upper bound of the distance between the estimated and underlying manifold. The smoothness of the estimated manifold is also evaluated by bounding the supremum of twice difference above. Numerical simulations are provided to validate our theoretical findings and demonstrate the advantages of our method over other relevant manifold fitting methods. Finally, our method is applied to real data examples.  ( 2 min )
    Efficient Ridge Solution for the Incremental Broad Learning System on Added Nodes by Inverse Cholesky Factorization of a Partitioned Matrix. (arXiv:1911.04872v4 [cs.LG] UPDATED)
    To accelerate the existing Broad Learning System (BLS) for new added nodes in [7], we extend the inverse Cholesky factorization in [10] to deduce an efficient inverse Cholesky factorization for a Hermitian matrix partitioned into 2 * 2 blocks, which is utilized to develop the proposed BLS algorithm 1. The proposed BLS algorithm 1 compute the ridge solution (i.e, the output weights) from the inverse Cholesky factor of the Hermitian matrix in the ridge inverse, and update the inverse Cholesky factor efficiently. From the proposed BLS algorithm 1, we deduce the proposed ridge inverse, which can be obtained from the generalized inverse in [7] by just change one matrix in the equation to compute the newly added sub-matrix. We also modify the proposed algorithm 1 into the proposed algorithm 2, which is equivalent to the existing BLS algorithm [7] in terms of numerical computations. The proposed algorithms 1 and 2 can reduce the computational complexity, since usually the Hermitian matrix in the ridge inverse is smaller than the ridge inverse. With respect to the existing BLS algorithm, the proposed algorithms 1 and 2 usually require about 13 and 2 3 of complexities, respectively, while in numerical experiments they achieve the speedups (in each additional training time) of 2.40 - 2.91 and 1.36 - 1.60, respectively. Numerical experiments also show that the proposed algorithm 1 and the standard ridge solution always bear the same testing accuracy, and usually so do the proposed algorithm 2 and the existing BLS algorithm. The existing BLS assumes the ridge parameter lamda->0, since it is based on the generalized inverse with the ridge regression approximation. When the assumption of lamda-> 0 is not satisfied, the standard ridge solution obviously achieves a better testing accuracy than the existing BLS algorithm in numerical experiments.  ( 3 min )
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v1 [cs.LG])
    Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.  ( 2 min )
    Fed-TDA: Federated Tabular Data Augmentation on Non-IID Data. (arXiv:2211.13116v2 [cs.LG] UPDATED)
    Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.  ( 2 min )
    Study of software developers' experience using the Github Copilot Tool in the software development process. (arXiv:2301.04991v1 [cs.SE])
    In software development there is a constant pressure to produce code faster and faster without compromising on quality. New tools supporting developers are created in response to this demand. Currently a new generation of such solutions is about to be launched - Artificial Intelligence driven tools. On 29 June 2021 Github Copilot was announced. It uses trained model to generate code based on human understandable language. The focus of this research was to investigate software developers' approach to this tool. For this purpose a survey containing 18 questions was prepared and shared with programmers. A total of 42 answers were gathered. The results of the research indicate that developers' opinions are divided. Most of them met Github Copilot before attending the survey. The attitude to the tool was mostly positive but not many participants were willing to use it. Concerns are caused by security issues associated with using of Github Copilot.
    A Stochastic Proximal Polyak Step Size. (arXiv:2301.04935v1 [math.OC])
    Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.
    Statistical Learning with Sublinear Regret of Propagator Models. (arXiv:2301.05157v1 [q-fin.TR])
    We consider a class of learning problems in which an agent liquidates a risky asset while creating both transient price impact driven by an unknown convolution propagator and linear temporary price impact with an unknown parameter. We characterize the trader's performance as maximization of a revenue-risk functional, where the trader also exploits available information on a price predicting signal. We present a trading algorithm that alternates between exploration and exploitation phases and achieves sublinear regrets with high probability. For the exploration phase we propose a novel approach for non-parametric estimation of the price impact kernel by observing only the visible price process and derive sharp bounds on the convergence rate, which are characterised by the singularity of the propagator. These kernel estimation methods extend existing methods from the area of Tikhonov regularisation for inverse problems and are of independent interest. The bound on the regret in the exploitation phase is obtained by deriving stability results for the optimizer and value function of the associated class of infinite-dimensional stochastic control problems. As a complementary result we propose a regression-based algorithm to estimate the conditional expectation of non-Markovian signals and derive its convergence rate.
    Thompson Sampling with Diffusion Generative Prior. (arXiv:2301.05182v1 [cs.LG])
    In this work, we initiate the idea of using denoising diffusion models to learn priors for online decision making problems. Our special focus is on the meta-learning for bandit framework, with the goal of learning a strategy that performs well across bandit tasks of a same class. To this end, we train a diffusion model that learns the underlying task distribution and combine Thompson sampling with the learned prior to deal with new tasks at test time. Our posterior sampling algorithm is designed to carefully balance between the learned prior and the noisy observations that come from the learner's interaction with the environment. To capture realistic bandit scenarios, we also propose a novel diffusion model training procedure that trains even from incomplete and/or noisy data, which could be of independent interest. Finally, our extensive experimental evaluations clearly demonstrate the potential of the proposed approach.
    Variational Inference: Posterior Threshold Improves Network Clustering Accuracy in Sparse Regimes. (arXiv:2301.04771v1 [stat.ML])
    Variational inference has been widely used in machine learning literature to fit various Bayesian models. In network analysis, this method has been successfully applied to solve the community detection problems. Although these results are promising, their theoretical support is only for relatively dense networks, an assumption that may not hold for real networks. In addition, it has been shown recently that the variational loss surface has many saddle points, which may severely affect its performance, especially when applied to sparse networks. This paper proposes a simple way to improve the variational inference method by hard thresholding the posterior of the community assignment after each iteration. Using a random initialization that correlates with the true community assignment, we show that the proposed method converges and can accurately recover the true community labels, even when the average node degree of the network is bounded. Extensive numerical study further confirms the advantage of the proposed method over the classical variational inference and another state-of-the-art algorithm.  ( 2 min )
    The Berkelmans-Pries Feature Importance Method: A Generic Measure of Informativeness of Features. (arXiv:2301.04740v1 [cs.LG])
    Over the past few years, the use of machine learning models has emerged as a generic and powerful means for prediction purposes. At the same time, there is a growing demand for interpretability of prediction models. To determine which features of a dataset are important to predict a target variable $Y$, a Feature Importance (FI) method can be used. By quantifying how important each feature is for predicting $Y$, irrelevant features can be identified and removed, which could increase the speed and accuracy of a model, and moreover, important features can be discovered, which could lead to valuable insights. A major problem with evaluating FI methods, is that the ground truth FI is often unknown. As a consequence, existing FI methods do not give the exact correct FI values. This is one of the many reasons why it can be hard to properly interpret the results of an FI method. Motivated by this, we introduce a new global approach named the Berkelmans-Pries FI method, which is based on a combination of Shapley values and the Berkelmans-Pries dependency function. We prove that our method has many useful properties, and accurately predicts the correct FI values for several cases where the ground truth FI can be derived in an exact manner. We experimentally show for a large collection of FI methods (468) that existing methods do not have the same useful properties. This shows that the Berkelmans-Pries FI method is a highly valuable tool for analyzing datasets with complex interdependencies.  ( 2 min )
    Private estimation algorithms for stochastic block models and mixture models. (arXiv:2301.04822v1 [cs.DS])
    We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.  ( 2 min )
    Universality of neural dynamics on complex networks. (arXiv:2301.04900v1 [cond-mat.stat-mech])
    This paper discusses the capacity of graph neural networks to learn the functional form of ordinary differential equations that govern dynamics on complex networks. We propose necessary elements for such a problem, namely, inductive biases, a neural network architecture and a learning task. Statistical learning theory suggests that generalisation power of neural networks relies on independence and identical distribution (i.i.d.)\ of training and testing data. Although this assumption together with an appropriate neural architecture and a learning mechanism is sufficient for accurate out-of-sample predictions of dynamics such as, e.g.\ mass-action kinetics, by studying the out-of-distribution generalisation in the case of diffusion dynamics, we find that the neural network model: (i) has a generalisation capacity that depends on the first moment of the initial value data distribution; (ii) learns the non-dissipative nature of dynamics implicitly; and (iii) the model's accuracy resolution limit is of order $\mathcal{O}(1/\sqrt{n})$ for a system of size $n$.  ( 2 min )
    Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction. (arXiv:2301.04791v1 [stat.ML])
    Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for redundant projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, the first issue of the current framework is the violation of permutation invariance property and symmetry property. To address the issue, we propose to design amortized models based on self-attention architecture. Moreover, we adopt efficient self-attention architectures to make the computation linear in the number of supports. Secondly, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when predicting the best discriminate projecting distribution. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications.  ( 2 min )
    Multimodal Deep Learning. (arXiv:2301.04856v1 [cs.CL])
    This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet.  ( 2 min )
  • Open

    Robust Phi-Divergence MDPs. (arXiv:2205.14202v2 [math.OC] UPDATED)
    In recent years, robust Markov decision processes (MDPs) have emerged as a prominent modeling framework for dynamic decision problems affected by uncertainty. In contrast to classical MDPs, which only account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, robust MDPs additionally account for ambiguity by optimizing in view of the most adverse transition kernel from a prescribed ambiguity set. In this paper, we develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets that decomposes the problem into a sequence of robust Bellman updates and simplex projections. Exploiting the rich structure present in the simplex projections corresponding to phi-divergence ambiguity sets, we show that the associated s-rectangular robust MDPs can be solved substantially faster than with state-of-the-art commercial solvers as well as a recent first-order solution scheme, thus rendering them attractive alternatives to classical MDPs in practical applications.  ( 2 min )
    See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning. (arXiv:2301.05226v1 [cs.CV])
    Large pre-trained vision and language models have demonstrated remarkable capacities for various tasks. However, solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect the external world knowledge, and perform step-by-step reasoning to answer the questions correctly. To this end, we propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our IPVR enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines.  ( 2 min )
    Masked Autoencoders that Listen. (arXiv:2207.06405v3 [cs.SD] UPDATED)
    This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.  ( 2 min )
    Model reduction for the material point method via an implicit neural representation of the deformation map. (arXiv:2109.12390v3 [cs.LG] UPDATED)
    This work proposes a model-reduction approach for the material point method on nonlinear manifolds. Our technique approximates the $\textit{kinematics}$ by approximating the deformation map using an implicit neural representation that restricts deformation trajectories to reside on a low-dimensional manifold. By explicitly approximating the deformation map, its spatiotemporal gradients -- in particular the deformation gradient and the velocity -- can be computed via analytical differentiation. In contrast to typical model-reduction techniques that construct a linear or nonlinear manifold to approximate the (finite number of) degrees of freedom characterizing a given spatial discretization, the use of an implicit neural representation enables the proposed method to approximate the $\textit{continuous}$ deformation map. This allows the kinematic approximation to remain agnostic to the discretization. Consequently, the technique supports dynamic discretizations -- including resolution changes -- during the course of the online reduced-order-model simulation. To generate $\textit{dynamics}$ for the generalized coordinates, we propose a family of projection techniques. At each time step, these techniques: (1) Calculate full-space kinematics at quadrature points, (2) Calculate the full-space dynamics for a subset of `sample' material points, and (3) Calculate the reduced-space dynamics by projecting the updated full-space position and velocity onto the low-dimensional manifold and tangent space, respectively. We achieve significant computational speedup via hyper-reduction that ensures all three steps execute on only a small subset of the problem's spatial domain. Large-scale numerical examples with millions of material points illustrate the method's ability to gain an order of magnitude computational-cost saving -- indeed $\textit{real-time simulations}$ -- with negligible errors.  ( 3 min )
    Adversarial Adaptation for French Named Entity Recognition. (arXiv:2301.05220v1 [cs.CL])
    Named Entity Recognition (NER) is the task of identifying and classifying named entities in large-scale texts into predefined classes. NER in French and other relatively limited-resource languages cannot always benefit from approaches proposed for languages like English due to a dearth of large, robust datasets. In this paper, we present our work that aims to mitigate the effects of this dearth of large, labeled datasets. We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora to improve feature extraction and enable better generalization. Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains to introduce more variations during training and reduce overfitting. Experimental results on three labeled datasets show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of Transformer models, source datasets, and target corpora. We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora.  ( 2 min )
    MANAS: Multi-Agent Neural Architecture Search. (arXiv:1909.01051v4 [cs.CV] UPDATED)
    The Neural Architecture Search (NAS) problem is typically formulated as a graph search problem where the goal is to learn the optimal operations over edges in order to maximise a graph-level global objective. Due to the large architecture parameter space, efficiency is a key bottleneck preventing NAS from its practical use. In this paper, we address the issue by framing NAS as a multi-agent problem where agents control a subset of the network and coordinate to reach optimal architectures. We provide two distinct lightweight implementations, with reduced memory requirements (1/8th of state-of-the-art), and performances above those of much more computationally expensive methods. Theoretically, we demonstrate vanishing regrets of the form O(sqrt(T)), with T being the total number of rounds. Finally, aware that random search is an, often ignored, effective baseline we perform additional experiments on 3 alternative datasets and 2 network configurations, and achieve favourable results in comparison.  ( 2 min )
    Anomalies, Representations, and Self-Supervision. (arXiv:2301.04660v1 [hep-ph])
    We develop a self-supervised method for density-based anomaly detection using contrastive learning, and test it using event-level anomaly data from CMS ADC2021. The AnomalyCLR technique is data-driven and uses augmentations of the background data to mimic non-Standard-Model events in a model-agnostic way. It uses a permutation-invariant Transformer Encoder architecture to map the objects measured in a collider event to the representation space, where the data augmentations define a representation space which is sensitive to potential anomalous features. An AutoEncoder trained on background representations then computes anomaly scores for a variety of signals in the representation space. With AnomalyCLR we find significant improvements on performance metrics for all signals when compared to the raw data baseline.  ( 2 min )
    Time Series Clustering with an EM algorithm for Mixtures of Linear Gaussian State Space Models. (arXiv:2208.11907v2 [cs.LG] UPDATED)
    In this paper, we consider the task of clustering a set of individual time series while modeling each cluster, that is, model-based time series clustering. The task requires a parametric model with sufficient flexibility to describe the dynamics in various time series. To address this problem, we propose a novel model-based time series clustering method with mixtures of linear Gaussian state space models, which have high flexibility. The proposed method uses a new expectation-maximization algorithm for the mixture model to estimate the model parameters, and determines the number of clusters using the Bayesian information criterion. Experiments on a simulated dataset demonstrate the effectiveness of the method in clustering, parameter estimation, and model selection. The method is applied to a real dataset for which previously proposed time series clustering methods exhibited low accuracy. Results showed that our method produces more accurate clustering results than those obtained using the previous methods.  ( 2 min )
    RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?. (arXiv:2108.04384v3 [cs.CV] UPDATED)
    For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation. In addition, we tackled the problem of fixed input image resolution for global MLP-based models by utilizing bicubic interpolation. We demonstrated that these models could be applied as the backbone of architectures for downstream tasks such as object detection. However, it did not have significant performance and mentioned the need for MLP-specific architectures for downstream tasks for global MLP-based models. The source code in PyTorch version is available at \url{https://github.com/okojoalg/raft-mlp}.  ( 3 min )
    Modeling the evolution of temporal knowledge graphs with uncertainty. (arXiv:2301.04977v1 [cs.LG])
    Forecasting future events is a fundamental challenge for temporal knowledge graphs (tKG). As in real life predicting a mean function is most of the time not sufficient, but the question remains how confident can we be about our prediction? Thus, in this work, we will introduce a novel graph neural network architecture (WGP-NN) employing (weighted) Gaussian processes (GP) to jointly model the temporal evolution of the occurrence probability of events and their time-dependent uncertainty. Especially we employ Gaussian processes to model the uncertainty of future links by their ability to predict predictive variance. This is in contrast to existing works, which are only able to express uncertainties in the learned entity representations. Moreover, WGP-NN can model parameter-free complex temporal and structural dynamics of tKGs in continuous time. We further demonstrate the model's state-of-the-art performance on two real-world benchmark datasets.  ( 2 min )
    Neural Systematic Binder. (arXiv:2211.01177v2 [cs.CV] UPDATED)
    The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor representations called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations.  ( 2 min )
    Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts. (arXiv:2210.03885v2 [cs.LG] UPDATED)
    In this paper, we tackle the problem of domain shift. Most existing methods perform training on multiple source domains using a single model, and the same trained model is used on all unseen target domains. Such solutions are sub-optimal as each target domain exhibits its own specialty, which is not adapted. Furthermore, expecting single-model training to learn extensive knowledge from multiple source domains is counterintuitive. The model is more biased toward learning only domain-invariant features and may result in negative knowledge transfer. In this work, we propose a novel framework for unsupervised test-time adaptation, which is formulated as a knowledge distillation process to address domain shift. Specifically, we incorporate Mixture-of-Experts (MoE) as teachers, where each expert is separately trained on different source domains to maximize their specialty. Given a test-time target domain, a small set of unlabeled data is sampled to query the knowledge from MoE. As the source domains are correlated to the target domains, a transformer-based aggregator then combines the domain knowledge by examining the interconnection among them. The output is treated as a supervision signal to adapt a student prediction network toward the target domain. We further employ meta-learning to enforce the aggregator to distill positive knowledge and the student network to achieve fast adaptation. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art and validates the effectiveness of each proposed component. Our code is available at https://github.com/n3il666/Meta-DMoE.  ( 2 min )
    Sequencer: Deep LSTM for Image Classification. (arXiv:2205.01972v4 [cs.CV] UPDATED)
    In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.  ( 2 min )
    A Network Science perspective of Graph Convolutional Networks: A survey. (arXiv:2301.04824v1 [cs.SI])
    The mining and exploitation of graph structural information have been the focal points in the study of complex networks. Traditional structural measures in Network Science focus on the analysis and modelling of complex networks from the perspective of network structure, such as the centrality measures, the clustering coefficient, and motifs and graphlets, and they have become basic tools for studying and understanding graphs. In comparison, graph neural networks, especially graph convolutional networks (GCNs), are particularly effective at integrating node features into graph structures via neighbourhood aggregation and message passing, and have been shown to significantly improve the performances in a variety of learning tasks. These two classes of methods are, however, typically treated separately with limited references to each other. In this work, aiming to establish relationships between them, we provide a network science perspective of GCNs. Our novel taxonomy classifies GCNs from three structural information angles, i.e., the layer-wise message aggregation scope, the message content, and the overall learning scope. Moreover, as a prerequisite for reviewing GCNs via a network science perspective, we also summarise traditional structural measures and propose a new taxonomy for them. Finally and most importantly, we draw connections between traditional structural approaches and graph convolutional networks, and discuss potential directions for future research.  ( 2 min )
    Fed-TDA: Federated Tabular Data Augmentation on Non-IID Data. (arXiv:2211.13116v2 [cs.LG] UPDATED)
    Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.  ( 2 min )
    Data-centric AI: Perspectives and Challenges. (arXiv:2301.04819v1 [cs.AI])
    The role of data in building AI systems has recently been significantly magnified by the emerging concept of data-centric AI (DCAI), which advocates a fundamental shift from model advancements to ensuring data quality and reliability. Although our community has continuously invested efforts into enhancing data in different aspects, they are often isolated initiatives on specific tasks. To facilitate the collective initiative in our community and push forward DCAI, we draw a big picture and bring together three general missions: training data development, evaluation data development, and data maintenance. We provide a top-level discussion on representative DCAI tasks and share perspectives. Finally, we list open challenges to motivate future exploration.  ( 2 min )
    Optirank: classification for RNA-Seq data with optimal ranking reference genes. (arXiv:2301.04653v1 [q-bio.GN])
    Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are used in a variety of biological applications. By nature, RNA-Seq data is subject to uncontrolled fluctuations both within and especially across datasets, which presents a major difficulty for a trained classifier to generalize to an external dataset. Replacing raw gene counts with the rank of gene counts inside an observation has proven effective to mitigate this problem. However, the rank of a feature is by definition relative to all other features, including highly variable features that introduce noise in the ranking. To address this problem and obtain more robust ranks, we propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking. We show the effectiveness of this method on simulated data. We also consider real classification tasks, which present different kinds of distribution shifts between train and test data. Those tasks concern a variety of applications, such as cancer of unknown primary classification, identification of specific gene signatures, and determination of cell type in single-cell RNA-Seq datasets. On those real tasks, optirank performs at least as well as the vanilla logistic regression on classical ranks, while producing sparser solutions. In addition, to increase the robustness against dataset shifts, we propose a multi-source learning scheme and demonstrate its effectiveness when used in combination with rank-based classifiers.  ( 2 min )
    Effective Decision Boundary Learning for Class Incremental Learning. (arXiv:2301.05180v1 [cs.LG])
    Rehearsal approaches in class incremental learning (CIL) suffer from decision boundary overfitting to new classes, which is mainly caused by two factors: insufficiency of old classes data for knowledge distillation and imbalanced data learning between the learned and new classes because of the limited storage memory. In this work, we present a simple but effective approach to tackle these two factors. First, we employ a re-sampling strategy and Mixup K}nowledge D}istillation (Re-MKD) to improve the performances of KD, which would greatly alleviate the overfitting problem. Specifically, we combine mixup and re-sampling strategies to synthesize adequate data used in KD training that are more consistent with the latent distribution between the learned and new classes. Second, we propose a novel incremental influence balance (IIB) method for CIL to tackle the classification of imbalanced data by extending the influence balance method into the CIL setting, which re-weights samples by their influences to create a proper decision boundary. With these two improvements, we present the effective decision boundary learning algorithm (EDBL) which improves the performance of KD and deals with the imbalanced data learning simultaneously. Experiments show that the proposed EDBL achieves state-of-the-art performances on several CIL benchmarks.
    Second-Order Mirror Descent: Convergence in Games Beyond Averaging and Discounting. (arXiv:2111.09982v3 [math.OC] UPDATED)
    In this paper, we propose a second-order extension of the continuous-time game-theoretic mirror descent (MD) dynamics, referred to as MD2, which provably converges to mere (but not necessarily strict) variationally stable states (VSS) without using common auxiliary techniques such as time-averaging or discounting. We show that MD2 enjoys no-regret as well as an exponential rate of convergence towards strong VSS upon a slight modification. MD2 can also be used to derive many novel continuous-time primal-space dynamics. We then use stochastic approximation techniques to provide a convergence guarantee of discrete-time MD2 with noisy observations towards interior mere VSS. Selected simulations are provided to illustrate our results.
    Improving Axial-Attention Network Classification via Cross-Channel Weight Sharing. (arXiv:2110.01185v2 [cs.CV] UPDATED)
    In recent years, hypercomplex-inspired neural networks (HCNNs) have been used to improve deep learning architectures due to their ability to enable channel-based weight sharing, treat colors as a single entity, and improve representational coherence within the layers. The work described herein studies the effect of replacing existing layers in an Axial Attention network with their representationally coherent variants to assess the effect on image classification. We experiment with the stem of the network, the bottleneck layers, and the fully connected backend, by replacing them with representationally coherent variants. These various modifications lead to novel architectures which all yield improved accuracy performance on the ImageNet300k classification dataset. Our baseline networks for comparison were the original real-valued ResNet, the original quaternion-valued ResNet, and the Axial Attention ResNet. Since improvement was observed regardless of which part of the network was modified, there is a promise that this technique may be generally useful in improving classification accuracy for a large class of networks.
    Growing Cosine Unit: A Novel Oscillatory Activation Function That Can Speedup Training and Reduce Parameters in Convolutional Neural Networks. (arXiv:2108.12943v3 [cs.LG] CROSS LISTED)
    Convolutional neural networks have been successful in solving many socially important and economically significant problems. This ability to learn complex high-dimensional functions hierarchically can be attributed to the use of nonlinear activation functions. A key discovery that made training deep networks feasible was the adoption of the Rectified Linear Unit (ReLU) activation function to alleviate the vanishing gradient problem caused by using saturating activation functions. Since then, many improved variants of the ReLU activation have been proposed. However, a majority of activation functions used today are non-oscillatory and monotonically increasing due to their biological plausibility. This paper demonstrates that oscillatory activation functions can improve gradient flow and reduce network size. Two theorems on limits of non-oscillatory activation functions are presented. A new oscillatory activation function called Growing Cosine Unit(GCU) defined as $C(z) = z\cos z$ that outperforms Sigmoids, Swish, Mish and ReLU on a variety of architectures and benchmarks is presented. The GCU activation has multiple zeros enabling single GCU neurons to have multiple hyperplanes in the decision boundary. This allows single GCU neurons to learn the XOR function without feature engineering. Experimental results indicate that replacing the activation function in the convolution layers with the GCU activation function significantly improves performance on CIFAR-10, CIFAR-100 and Imagenette.
    Bayesian inference via sparse Hamiltonian flows. (arXiv:2203.05723v2 [stat.ML] UPDATED)
    A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during Bayesian inference, with the goal of reducing computational cost. Although past work has shown empirically that there often exists a coreset with low inferential error, efficiently constructing such a coreset remains a challenge. Current methods tend to be slow, require a secondary inference step after coreset construction, and do not provide bounds on the data marginal evidence. In this work, we introduce a new method -- sparse Hamiltonian flows -- that addresses all three of these challenges. The method involves first subsampling the data uniformly, and then optimizing a Hamiltonian flow parametrized by coreset weights and including periodic momentum quasi-refreshment steps. Theoretical results show that the method enables an exponential compression of the dataset in a representative model, and that the quasi-refreshment steps reduce the KL divergence to the target. Real and synthetic experiments demonstrate that sparse Hamiltonian flows provide accurate posterior approximations with significantly reduced runtime compared with competing dynamical-system-based inference methods.
    LB-SimTSC: An Efficient Similarity-Aware Graph Neural Network for Semi-Supervised Time Series Classification. (arXiv:2301.04838v1 [cs.LG])
    Time series classification is an important data mining task that has received a lot of interest in the past two decades. Due to the label scarcity in practice, semi-supervised time series classification with only a few labeled samples has become popular. Recently, Similarity-aware Time Series Classification (SimTSC) is proposed to address this problem by using a graph neural network classification model on the graph generated from pairwise Dynamic Time Warping (DTW) distance of batch data. It shows excellent accuracy and outperforms state-of-the-art deep learning models in several few-label settings. However, since SimTSC relies on pairwise DTW distances, the quadratic complexity of DTW limits its usability to only reasonably sized datasets. To address this challenge, we propose a new efficient semi-supervised time series classification technique, LB-SimTSC, with a new graph construction module. Instead of using DTW, we propose to utilize a lower bound of DTW, LB_Keogh, to approximate the dissimilarity between instances in linear time, while retaining the relative proximity relationships one would have obtained via computing DTW. We construct the pairwise distance matrix using LB_Keogh and build a graph for the graph neural network. We apply this approach to the ten largest datasets from the well-known UCR time series classification archive. The results demonstrate that this approach can be up to 104x faster than SimTSC when constructing the graph on large datasets without significantly decreasing classification accuracy.
    Learning to compile smartly for program size reduction. (arXiv:2301.05104v1 [cs.PL])
    Compiler optimization passes are an important tool for improving program efficiency and reducing program size, but manually selecting optimization passes can be time-consuming and error-prone. While human experts have identified a few fixed sequences of optimization passes (e.g., the Clang -Oz passes) that perform well for a wide variety of programs, these sequences are not conditioned on specific programs. In this paper, we propose a novel approach that learns a policy to select passes for program size reduction, allowing for customization and adaptation to specific programs. Our approach uses a search mechanism that helps identify useful pass sequences and a GNN with customized attention that selects the optimal sequence to use. Crucially it is able to generalize to new, unseen programs, making it more flexible and general than previous approaches. We evaluate our approach on a range of programs and show that it leads to size reduction compared to traditional optimization techniques. Our results demonstrate the potential of a single policy that is able to optimize many programs.
    Understanding Difficulty-based Sample Weighting with a Universal Difficulty Measure. (arXiv:2301.04850v1 [cs.LG])
    Sample weighting is widely used in deep learning. A large number of weighting methods essentially utilize the learning difficulty of training samples to calculate their weights. In this study, this scheme is called difficulty-based weighting. Two important issues arise when explaining this scheme. First, a unified difficulty measure that can be theoretically guaranteed for training samples does not exist. The learning difficulties of the samples are determined by multiple factors including noise level, imbalance degree, margin, and uncertainty. Nevertheless, existing measures only consider a single factor or in part, but not in their entirety. Second, a comprehensive theoretical explanation is lacking with respect to demonstrating why difficulty-based weighting schemes are effective in deep learning. In this study, we theoretically prove that the generalization error of a sample can be used as a universal difficulty measure. Furthermore, we provide formal theoretical justifications on the role of difficulty-based weighting for deep learning, consequently revealing its positive influences on both the optimization dynamics and generalization performance of deep models, which is instructive to existing weighting schemes.
    NOPA: Neurally-guided Online Probabilistic Assistance for Building Socially Intelligent Home Assistants. (arXiv:2301.05223v1 [cs.RO])
    In this work, we study how to build socially intelligent robots to assist people in their homes. In particular, we focus on assistance with online goal inference, where robots must simultaneously infer humans' goals and how to help them achieve those goals. Prior assistance methods either lack the adaptivity to adjust helping strategies (i.e., when and how to help) in response to uncertainty about goals or the scalability to conduct fast inference in a large goal space. Our NOPA (Neurally-guided Online Probabilistic Assistance) method addresses both of these challenges. NOPA consists of (1) an online goal inference module combining neural goal proposals with inverse planning and particle filtering for robust inference under uncertainty, and (2) a helping planner that discovers valuable subgoals to help with and is aware of the uncertainty in goal inference. We compare NOPA against multiple baselines in a new embodied AI assistance challenge: Online Watch-And-Help, in which a helper agent needs to simultaneously watch a main agent's action, infer its goal, and help perform a common household task faster in realistic virtual home environments. Experiments show that our helper agent robustly updates its goal inference and adapts its helping plans to the changing level of uncertainty.
    Exploration in Deep Reinforcement Learning: A Comprehensive Survey. (arXiv:2109.06668v5 [cs.AI] UPDATED)
    Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant successes across a wide range of domains, including game AI, autonomous vehicles, robotics, and so on. However, DRL and deep MARL agents are widely known to be sample inefficient that millions of interactions are usually needed even for relatively simple problem settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how efficiently exploring the environment and collecting informative experiences that could benefit policy learning towards the optimal ones. This problem becomes more challenging in complex environments with sparse rewards, noisy distractions, long horizons, and non-stationary co-learners. In this paper, we conduct a comprehensive survey on existing exploration methods for both single-agent and multi-agent RL. We start the survey by identifying several key challenges to efficient exploration. Beyond the above two main branches, we also include other notable exploration methods with different ideas and techniques. In addition to algorithmic analysis, we provide a comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks. According to our algorithmic and empirical investigation, we finally summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.
    Signed Directed Graph Contrastive Learning with Laplacian Augmentation. (arXiv:2301.05163v1 [cs.LG])
    Graph contrastive learning has become a powerful technique for several graph mining tasks. It learns discriminative representation from different perspectives of augmented graphs. Ubiquitous in our daily life, singed-directed graphs are the most complex and tricky to analyze among various graph types. That is why singed-directed graph contrastive learning has not been studied much yet, while there are many contrastive studies for unsigned and undirected. Thus, this paper proposes a novel signed-directed graph contrastive learning, SDGCL. It makes two different structurally perturbed graph views and gets node representations via magnetic Laplacian perturbation. We use a node-level contrastive loss to maximize the mutual information between the two graph views. The model is jointly learned with contrastive and supervised objectives. The graph encoder of SDGCL does not depend on social theories or predefined assumptions. Therefore it does not require finding triads or selecting neighbors to aggregate. It leverages only the edge signs and directions via magnetic Laplacian. To the best of our knowledge, it is the first to introduce magnetic Laplacian perturbation and signed spectral graph contrastive learning. The superiority of the proposed model is demonstrated through exhaustive experiments on four real-world datasets. SDGCL shows better performance than other state-of-the-art on four evaluation metrics.
    PiFold: Toward effective and efficient protein inverse folding. (arXiv:2209.12643v3 [cs.AI] UPDATED)
    How can we design protein sequences folding into the desired structures effectively and efficiently? Structure-based protein design has attracted increasing attention in recent years; however, few methods can simultaneously improve the accuracy and efficiency due to the lack of expressive features and autoregressive sequence decoder. To address these issues, we propose PiFold, which contains a novel residue featurizer and PiGNN layers to generate protein sequences in a one-shot way with improved recovery. Experiments show that PiFold could achieve 51.66\% recovery on CATH 4.2, while the inference speed is 70 times faster than the autoregressive competitors. In addition, PiFold achieves 58.72\% and 60.42\% recovery scores on TS50 and TS500, respectively. We conduct comprehensive ablation studies to reveal the role of different types of protein features and model designs, inspiring further simplification and improvement.
    Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study. (arXiv:2301.05174v1 [cs.IR])
    Most approaches to cross-modal retrieval (CMR) focus either on object-centric datasets, meaning that each document depicts or describes a single object, or on scene-centric datasets, meaning that each image depicts or describes a complex scene that involves multiple objects and relations between them. We posit that a robust CMR model should generalize well across both dataset types. Despite recent advances in CMR, the reproducibility of the results and their generalizability across different dataset types has not been studied before. We address this gap and focus on the reproducibility of the state-of-the-art CMR results when evaluated on object-centric and scene-centric datasets. We select two state-of-the-art CMR models with different architectures: (i) CLIP; and (ii) X-VLM. Additionally, we select two scene-centric datasets, and three object-centric datasets, and determine the relative performance of the selected models on these datasets. We focus on reproducibility, replicability, and generalizability of the outcomes of previously published CMR experiments. We discover that the experiments are not fully reproducible and replicable. Besides, the relative performance results partially generalize across object-centric and scene-centric datasets. On top of that, the scores obtained on object-centric datasets are much lower than the scores obtained on scene-centric datasets. For reproducibility and transparency we make our source code and the trained models publicly available.
    Statistical Learning with Sublinear Regret of Propagator Models. (arXiv:2301.05157v1 [q-fin.TR])
    We consider a class of learning problems in which an agent liquidates a risky asset while creating both transient price impact driven by an unknown convolution propagator and linear temporary price impact with an unknown parameter. We characterize the trader's performance as maximization of a revenue-risk functional, where the trader also exploits available information on a price predicting signal. We present a trading algorithm that alternates between exploration and exploitation phases and achieves sublinear regrets with high probability. For the exploration phase we propose a novel approach for non-parametric estimation of the price impact kernel by observing only the visible price process and derive sharp bounds on the convergence rate, which are characterised by the singularity of the propagator. These kernel estimation methods extend existing methods from the area of Tikhonov regularisation for inverse problems and are of independent interest. The bound on the regret in the exploitation phase is obtained by deriving stability results for the optimizer and value function of the associated class of infinite-dimensional stochastic control problems. As a complementary result we propose a regression-based algorithm to estimate the conditional expectation of non-Markovian signals and derive its convergence rate.
    Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Pruning. (arXiv:2301.05219v1 [cs.CV])
    The state of neural network pruning has been noticed to be unclear and even confusing for a while, largely due to "a lack of standardized benchmarks and metrics" [3]. To standardize benchmarks, first, we need to answer: what kind of comparison setup is considered fair? This basic yet crucial question has barely been clarified in the community, unfortunately. Meanwhile, we observe several papers have used (severely) sub-optimal hyper-parameters in pruning experiments, while the reason behind them is also elusive. These sub-optimal hyper-parameters further exacerbate the distorted benchmarks, rendering the state of neural network pruning even more obscure. Two mysteries in pruning represent such a confusing status: the performance-boosting effect of a larger finetuning learning rate, and the no-value argument of inheriting pretrained weights in filter pruning. In this work, we attempt to explain the confusing state of network pruning by demystifying the two mysteries. Specifically, (1) we first clarify the fairness principle in pruning experiments and summarize the widely-used comparison setups; (2) then we unveil the two pruning mysteries and point out the central role of network trainability, which has not been well recognized so far; (3) finally, we conclude the paper and give some concrete suggestions regarding how to calibrate the pruning benchmarks in the future. Code: https://github.com/mingsun-tse/why-the-state-of-pruning-so-confusing.
    A Cognitive Evaluation of Instruction Generation Agents tl;dr They Need Better Theory-of-Mind Capabilities. (arXiv:2301.05149v1 [cs.CL])
    We mathematically characterize the cognitive capabilities that enable humans to effectively guide others through natural language. We show that neural-network-based instruction generation agents possess similar cognitive capabilities, and design an evaluation scheme for probing those capabilities. Our results indicate that these agents, while capable of effectively narrowing the search space, poorly predict the listener's interpretations of their instructions and thus often fail to select the best instructions even from a small candidate set. We augment the agents with better theory-of-mind models of the listener and obtain significant performance boost in guiding real humans. Yet, there remains a considerable gap between our best agent and human guides. We discuss the challenges in closing this gap, emphasizing the need to construct better models of human behavior when interacting with AI-based agents.
    Fairly Private: Investigating The Fairness of Visual Privacy Preservation Algorithms. (arXiv:2301.05012v1 [cs.CV])
    As the privacy risks posed by camera surveillance and facial recognition have grown, so has the research into privacy preservation algorithms. Among these, visual privacy preservation algorithms attempt to impart bodily privacy to subjects in visuals by obfuscating privacy-sensitive areas. While disparate performances of facial recognition systems across phenotypes are the subject of much study, its counterpart, privacy preservation, is not commonly analysed from a fairness perspective. In this paper, the fairness of commonly used visual privacy preservation algorithms is investigated through the performances of facial recognition models on obfuscated images. Experiments on the PubFig dataset clearly show that the privacy protection provided is unequal across groups.
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v1 [cs.LG])
    Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.
    Choose, not Hoard: Information-to-Model Matching for Artificial Intelligence in O-RAN. (arXiv:2208.04229v2 [cs.NI] UPDATED)
    Open Radio Access Network (O-RAN) is an emerging paradigm, whereby virtualized network infrastructure elements from different vendors communicate via open, standardized interfaces. A key element therein is the RAN Intelligent Controller (RIC), an Artificial Intelligence (AI)-based controller. Traditionally, all data available in the network has been used to train a single AI model to be used at the RIC. This paper introduces, discusses, and evaluates the creation of multiple AI model instances at different RICs, leveraging information from some (or all) locations for their training. This brings about a flexible relationship between gNBs, the AI models used to control them, and the data such models are trained with. Experiments with real-world traces show how using multiple AI model instances that choose training data from specific locations improve the performance of traditional approaches following the hoarding strategy.
    Battery Degradation Long-term Forecast Using Gaussian Process Dynamical Models and Knowledge Transfer. (arXiv:2212.01609v2 [cs.LG] UPDATED)
    Batteries plays an essential role in modern energy ecosystem and are widely used in daily applications such as cell phones and electric vehicles. For many applications, the health status of batteries plays a critical role in the performance of the system by indicating efficient maintenance and on-time replacement. Directly modeling an individual battery using a computational models based on physical rules can be of low-efficiency, in terms of the difficulties in build such a model and the computational effort of tuning and running it especially on the edge. With the rapid development of sensor technology (to provide more insights into the system) and machine learning (to build capable yet fast model), it is now possible to directly build a data-riven model of the battery health status using the data collected from historical battery data (being possibly local and remote) to predict local battery health status in the future accurately. Nevertheless, most data-driven methods are trained based on the local battery data and lack the ability to extract common properties, such as generations and degradation, in the life span of other remote batteries. In this paper, we utilize a Gaussian process dynamical model (GPDM) to build a data-driven model of battery health status and propose a knowledge transfer method to extract common properties in the life span of all batteries to accurately predict the battery health status with and without features extracted from the local battery. For modern benchmark problems, the proposed method outperform the state-of-the-art methods with significant margins in terms of accuracy and is able to accuracy predict the regeneration process.
    Interaction models for remaining useful life estimation. (arXiv:2301.05029v1 [cs.LG])
    The paper deals with the problem of controlling the state of industrial devices according to the readings of their sensors. The current methods rely on one approach to feature extraction in which the prediction occurs. We proposed a technique to build a scalable model that combines multiple different feature extractor blocks. A new model based on sequential sensor space analysis achieves state-of-the-art results on the C-MAPSS benchmark for equipment remaining useful life estimation. The resulting model performance was validated including the prediction changes with scaling.
    Linking Neural Collapse and L2 Normalization with Improved Out-of-Distribution Detection in Deep Neural Networks. (arXiv:2209.08378v3 [cs.LG] UPDATED)
    We propose a simple modification to standard ResNet architectures--L2 normalization over feature space--that substantially improves out-of-distribution (OoD) performance on the previously proposed Deep Deterministic Uncertainty (DDU) benchmark. We show that this change also induces early Neural Collapse (NC), an effect linked to better OoD performance. Our method achieves comparable or superior OoD detection scores and classification accuracy in a small fraction of the training time of the benchmark. Additionally, it substantially improves worst case OoD performance over multiple, randomly initialized models. Though we do not suggest that NC is the sole mechanism or a comprehensive explanation for OoD behaviour in deep neural networks (DNN), we believe NC's simple mathematical and geometric structure can provide a framework for analysis of this complex phenomenon in future work.
    Improvement of Computational Performance of Evolutionary AutoML in a Heterogeneous Environment. (arXiv:2301.05102v1 [cs.LG])
    Resource-intensive computations are a major factor that limits the effectiveness of automated machine learning solutions. In the paper, we propose a modular approach that can be used to increase the quality of evolutionary optimization for modelling pipelines with a graph-based structure. It consists of several stages - parallelization, caching and evaluation. Heterogeneous and remote resources can be involved in the evaluation stage. The conducted experiments confirm the correctness and effectiveness of the proposed approach. The implemented algorithms are available as a part of the open-source framework FEDOT.
    Progress measures for grokking via mechanistic interpretability. (arXiv:2301.05217v1 [cs.LG])
    Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.
    Recognition Models to Learn Dynamics from Partial Observations with Neural ODEs. (arXiv:2205.12550v3 [eess.SY] UPDATED)
    Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. Neural ordinary differential equations can be written as a flexible framework for system identification and can incorporate a broad spectrum of physical insight, giving physical interpretability to the resulting latent space. In the case of partial observations, however, the data points cannot directly be mapped to the latent state of the ODE. Hence, we propose to design recognition models, in particular inspired by nonlinear observer theory, to link the partial observations to the latent state. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.
    DROPO: Sim-to-Real Transfer with Offline Domain Randomization. (arXiv:2201.08434v2 [cs.RO] UPDATED)
    In recent years, domain randomization over dynamics parameters has gained a lot of traction as a method for sim-to-real transfer of reinforcement learning policies in robotic manipulation; however, finding optimal randomization distributions can be difficult. In this paper, we introduce DROPO, a novel method for estimating domain randomization distributions for safe sim-to-real transfer. Unlike prior work, DROPO only requires a limited, precollected offline dataset of trajectories, and explicitly models parameter uncertainty to match real data using a likelihood-based approach. We demonstrate that DROPO is capable of recovering dynamic parameter distributions in simulation and finding a distribution capable of compensating for an unmodeled phenomenon. We also evaluate the method in two zero-shot sim-to-real transfer scenarios, showing successful domain transfer and improved performance over prior methods.
    Estimate Deformation Capacity of Non-Ductile RC Shear Walls using Explainable Boosting Machine. (arXiv:2301.04652v1 [cs.LG])
    Machine learning is becoming increasingly prevalent for tackling challenges in earthquake engineering and providing fairly reliable and accurate predictions. However, it is mostly unclear how decisions are made because machine learning models are generally highly sophisticated, resulting in opaque black-box models. Machine learning models that are naturally interpretable and provide their own decision explanation, rather than using an explanatory, are more accurate in determining what the model actually computes. With this motivation, this study aims to develop a fully explainable machine learning model to predict the deformation capacity of non-ductile reinforced concrete shear walls based on experimental data collected worldwide. The proposed Explainable Boosting Machines (EBM)-based model is an interpretable, robust, naturally explainable glass-box model, yet provides high accuracy comparable to its black-box counterparts. The model enables the user to observe the relationship between the wall properties and the deformation capacity by quantifying the individual contribution of each wall property as well as the correlations among them. The mean coefficient of determination R2 and the mean ratio of predicted to actual value based on the test dataset are 0.92 and 1.05, respectively. The proposed predictive model stands out with its overall consistency with scientific knowledge, practicality, and interpretability without sacrificing high accuracy.
    Forgetful Active Learning with Switch Events: Efficient Sampling for Out-of-Distribution Data. (arXiv:2301.05106v1 [cs.LG])
    This paper considers deep out-of-distribution active learning. In practice, fully trained neural networks interact randomly with out-of-distribution (OOD) inputs and map aberrant samples randomly within the model representation space. Since data representations are direct manifestations of the training distribution, the data selection process plays a crucial role in outlier robustness. For paradigms such as active learning, this is especially challenging since protocols must not only improve performance on the training distribution most effectively but further render a robust representation space. However, existing strategies directly base the data selection on the data representation of the unlabeled data which is random for OOD samples by definition. For this purpose, we introduce forgetful active learning with switch events (FALSE) - a novel active learning protocol for out-of-distribution active learning. Instead of defining sample importance on the data representation directly, we formulate "informativeness" with learning difficulty during training. Specifically, we approximate how often the network "forgets" unlabeled samples and query the most "forgotten" samples for annotation. We report up to 4.5\% accuracy improvements in over 270 experiments, including four commonly used protocols, two OOD benchmarks, one in-distribution benchmark, and three different architectures.
    Masked Feature Prediction for Self-Supervised Visual Pre-Training. (arXiv:2112.09133v2 [cs.CV] UPDATED)
    We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
    Fast spline detection in high density microscopy data. (arXiv:2301.04460v1 [cs.CV] CROSS LISTED)
    Computer-aided analysis of biological microscopy data has seen a massive improvement with the utilization of general-purpose deep learning techniques. Yet, in microscopy studies of multi-organism systems, the problem of collision and overlap remains challenging. This is particularly true for systems composed of slender bodies such as crawling nematodes, swimming spermatozoa, or the beating of eukaryotic or prokaryotic flagella. Here, we develop a novel end-to-end deep learning approach to extract precise shape trajectories of generally motile and overlapping splines. Our method works in low resolution settings where feature keypoints are hard to define and detect. Detection is fast and we demonstrate the ability to track thousands of overlapping organisms simultaneously. While our approach is agnostic to area of application, we present it in the setting of and exemplify its usability on dense experiments of crawling Caenorhabditis elegans. The model training is achieved purely on synthetic data, utilizing a physics-based model for nematode motility, and we demonstrate the model's ability to generalize from simulations to experimental videos.
    Multi-Power Level $Q$-Learning Algorithm for Random Access in NOMA mMTC Systems. (arXiv:2301.05196v1 [cs.NI])
    The massive machine-type communications (mMTC) service will be part of new services planned to integrate the fifth generation of wireless communication (B5G). In mMTC, thousands of devices sporadically access available resource blocks on the network. In this scenario, the massive random access (RA) problem arises when two or more devices collide when selecting the same resource block. There are several techniques to deal with this problem. One of them deploys $Q$-learning (QL), in which devices store in their $Q$-table the rewards sent by the central node that indicate the quality of the transmission performed. The device learns the best resource blocks to select and transmit to avoid collisions. We propose a multi-power level QL (MPL-QL) algorithm that uses non-orthogonal multiple access (NOMA) transmit scheme to generate transmission power diversity and allow {accommodate} more than one device in the same time-slot as long as the signal-to-interference-plus-noise ratio (SINR) exceeds a threshold value. The numerical results reveal that the best performance-complexity trade-off is obtained by using a {higher {number of} power levels, typically eight levels}. The proposed MPL-QL {can deliver} better throughput and lower latency compared to other recent QL-based algorithms found in the literature
    ECSAS: Exploring Critical Scenarios from Action Sequence in Autonomous Driving. (arXiv:2209.10078v2 [cs.AI] UPDATED)
    Critical scenario generation requires the ability of sampling critical combinations from the infinite parameter space in the logic scenario. Existing solutions aim to explore the correlation of action parameters in the initial scenario rather than action sequences. How to model action sequences so that one can further consider the effects of different action parameters in the scenario is the bottleneck of the problem. In this paper, we attack the problem by proposing the ECSAS framework. Specifically, we first propose a description language, BTScenario, allowing us to model action sequences of the scenarios. We then use reinforcement learning to search for combinations of critical action parameters. To increase efficiency, we further propose several optimizations, including action masking and replay buffer. We have implemented ECSAS, and experimental results show that it is more efficient than native approaches such as random and combination testing in various nontrivial scenarios.
    Smart-Badge: A wearable badge with multi-modal sensors for kitchen activity recognition. (arXiv:2210.00888v2 [cs.LG] UPDATED)
    Human health is closely associated with their daily behavior and environment. However, keeping a healthy lifestyle is still challenging for most people as it is difficult to recognize their living behaviors and identify their surrounding situations to take appropriate action. Human activity recognition is a promising approach to building a behavior model of users, by which users can get feedback about their habits and be encouraged to develop a healthier lifestyle. In this paper, we present a smart light wearable badge with six kinds of sensors, including an infrared array sensor MLX90640 offering privacy-preserving, low-cost, and non-invasive features, to recognize daily activities in a realistic unmodified kitchen environment. A multi-channel convolutional neural network (MC-CNN) based on data and feature fusion methods is applied to classify 14 human activities associated with potentially unhealthy habits. Meanwhile, we evaluate the impact of the infrared array sensor on the recognition accuracy of these activities. We demonstrate the performance of the proposed work to detect the 14 activities performed by ten volunteers with an average accuracy of 92.44 % and an F1 score of 88.27 %.
    RAP: Risk-Aware Prediction for Robust Planning. (arXiv:2210.01368v2 [cs.LG] UPDATED)
    Robust planning in interactive scenarios requires predicting the uncertain future to make risk-aware decisions. Unfortunately, due to long-tail safety-critical events, the risk is often under-estimated by finite-sampling approximations of probabilistic motion forecasts. This can lead to overconfident and unsafe robot behavior, even with robust planners. Instead of assuming full prediction coverage that robust planners require, we propose to make prediction itself risk-aware. We introduce a new prediction objective to learn a risk-biased distribution over trajectories, so that risk evaluation simplifies to an expected cost estimation under this biased distribution. This reduces the sample complexity of the risk estimation during online planning, which is needed for safe real-time performance. Evaluation results in a didactic simulation environment and on a real-world dataset demonstrate the effectiveness of our approach. The code and a demo are available.
    Deep learning enhanced noise spectroscopy of a spin qubit environment. (arXiv:2301.05079v1 [quant-ph])
    The undesired interaction of a quantum system with its environment generally leads to a coherence decay of superposition states in time. A precise knowledge of the spectral content of the noise induced by the environment is crucial to protect qubit coherence and optimize its employment in quantum device applications. We experimentally show that the use of neural networks can highly increase the accuracy of noise spectroscopy, by reconstructing the power spectral density that characterizes an ensemble of carbon impurities around a nitrogen-vacancy (NV) center in diamond. Neural networks are trained over spin coherence functions of the NV center subjected to different Carr-Purcell sequences, typically used for dynamical decoupling (DD). As a result, we determine that deep learning models can be more accurate than standard DD noise-spectroscopy techniques, by requiring at the same time a much smaller number of DD sequences.
    Domain Expansion of Image Generators. (arXiv:2301.05225v1 [cs.CV])
    Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task - domain expansion - to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" directions, which do not affect the output. This provides an opportunity: By "repurposing" these directions, we can represent new domains without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several - even hundreds - of new domains! Using our expansion method, one "expanded" model can supersede numerous domain-specific models, without expanding the model size. Additionally, a single expanded generator natively supports smooth transitions between domains, as well as composition of domains. Code and project page available at https://yotamnitzan.github.io/domain-expansion/.
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v4 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.
    Kinematic Evidence of an Embedded Protoplanet in HD 142666 Identified by Machine Learning. (arXiv:2301.05075v1 [astro-ph.EP])
    Observations of protoplanetary discs have shown that forming exoplanets leave characteristic imprints on the gas and dust of the disc. In the gas, these forming exoplanets cause deviations from Keplerian motion, which can be detected through molecular line observations. Our previous work has shown that machine learning can correctly determine if a planet is present in these discs. Using our machine learning models, we identify strong, localized non-Keplerian motion within the disc HD 142666. Subsequent hydrodynamics simulations of a system with a 5 Jupiter-mass planet at 75 au recreates the kinematic structure. By currently established standards in the field, we conclude that HD 142666 hosts a planet. This work represents a first step towards using machine learning to identify previously overlooked non-Keplerian features in protoplanetary discs.
    Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning. (arXiv:2301.05169v1 [cs.LG])
    Recent years have seen a surge of interest in learning high-level causal representations from low-level image pairs under interventions. Yet, existing efforts are largely limited to simple synthetic settings that are far away from real-world problems. In this paper, we present Causal Triplet, a causal representation learning benchmark featuring not only visually more complex scenes, but also two crucial desiderata commonly overlooked in previous works: (i) an actionable counterfactual setting, where only certain object-level variables allow for counterfactual observations whereas others do not; (ii) an interventional downstream task with an emphasis on out-of-distribution robustness from the independent causal mechanisms principle. Through extensive experiments, we find that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts. However, recent causal representation learning methods still struggle to identify such latent structures, indicating substantial challenges and opportunities for future work. Our code and datasets will be available at https://sites.google.com/view/causaltriplet.
    Meta-Query-Net: Resolving Purity-Informativeness Dilemma in Open-set Active Learning. (arXiv:2210.07805v3 [cs.LG] UPDATED)
    Unlabeled data examples awaiting annotations contain open-set noise inevitably. A few active learning studies have attempted to deal with this open-set noise for sample selection by filtering out the noisy examples. However, because focusing on the purity of examples in a query set leads to overlooking the informativeness of the examples, the best balancing of purity and informativeness remains an important question. In this paper, to solve this purity-informativeness dilemma in open-set active learning, we propose a novel Meta-Query-Net,(MQ-Net) that adaptively finds the best balancing between the two factors. Specifically, by leveraging the multi-round property of active learning, we train MQ-Net using a query set without an additional validation set. Furthermore, a clear dominance relationship between unlabeled examples is effectively captured by MQ-Net through a novel skyline regularization. Extensive experiments on multiple open-set active learning scenarios demonstrate that the proposed MQ-Net achieves 20.14% improvement in terms of accuracy, compared with the state-of-the-art methods.
    SemPPL: Predicting pseudo-labels for better contrastive representations. (arXiv:2301.05158v1 [cs.CV])
    Learning from large amounts of unsupervised data and a small amount of supervision is an important open problem in computer vision. We propose a new semi-supervised learning method, Semantic Positives via Pseudo-Labels (SemPPL), that combines labelled and unlabelled data to learn informative representations. Our method extends self-supervised contrastive learning -- where representations are shaped by distinguishing whether two samples represent the same underlying datum (positives) or not (negatives) -- with a novel approach to selecting positives. To enrich the set of positives, we leverage the few existing ground-truth labels to predict the missing ones through a $k$-nearest neighbours classifier by using the learned embeddings of the labelled data. We thus extend the set of positives with datapoints having the same pseudo-label and call these semantic positives. We jointly learn the representation and predict bootstrapped pseudo-labels. This creates a reinforcing cycle. Strong initial representations enable better pseudo-label predictions which then improve the selection of semantic positives and lead to even better representations. SemPPL outperforms competing semi-supervised methods setting new state-of-the-art performance of $68.5\%$ and $76\%$ top-$1$ accuracy when using a ResNet-$50$ and training on $1\%$ and $10\%$ of labels on ImageNet, respectively. Furthermore, when using selective kernels, SemPPL significantly outperforms previous state-of-the-art achieving $72.3\%$ and $78.3\%$ top-$1$ accuracy on ImageNet with $1\%$ and $10\%$ labels, respectively, which improves absolute $+7.8\%$ and $+6.2\%$ over previous work. SemPPL also exhibits state-of-the-art performance over larger ResNet models as well as strong robustness, out-of-distribution and transfer performance.
    Explicit Context Integrated Recurrent Neural Network for Sensor Data Applications. (arXiv:2301.05031v1 [cs.LG])
    The development and progress in sensor, communication and computing technologies have led to data rich environments. In such environments, data can easily be acquired not only from the monitored entities but also from the surroundings where the entity is operating. The additional data that are available from the problem domain, which cannot be used independently for learning models, constitute context. Such context, if taken into account while learning, can potentially improve the performance of predictive models. Typically, the data from various sensors are present in the form of time series. Recurrent Neural Networks (RNNs) are preferred for such data as it can inherently handle temporal context. However, the conventional RNN models such as Elman RNN, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) in their present form do not provide any mechanism to integrate explicit contexts. In this paper, we propose a Context Integrated RNN (CiRNN) that enables integrating explicit contexts represented in the form of contextual features. In CiRNN, the network weights are influenced by contextual features in such a way that the primary input features which are more relevant to a given context are given more importance. To show the efficacy of CiRNN, we selected an application domain, engine health prognostics, which captures data from various sensors and where contextual information is available. We used the NASA Turbofan Engine Degradation Simulation dataset for estimating Remaining Useful Life (RUL) as it provides contextual information. We compared CiRNN with baseline models as well as the state-of-the-art methods. The experimental results show an improvement of 39% and 87% respectively, over state-of-the art models, when performance is measured with RMSE and score from an asymmetric scoring function. The latter measure is specific to the task of RUL estimation.
    Automated Sleep Staging via Parallel Frequency-Cut Attention. (arXiv:2204.03173v3 [cs.LG] UPDATED)
    This paper proposes a novel framework for automatically capturing the time-frequency nature of electroencephalogram (EEG) signals of human sleep based on the authoritative sleep medicine guidance. The framework consists of two parts: the first part extracts informative features by partitioning the input EEG spectrograms into a sequence of time-frequency patches. The second part is constituted by an attention-based architecture to efficiently search for the correlation between partitioned time-frequency patches and defining factors of sleep stages in parallel. The proposed pipeline is validated on the Sleep Heart Health Study dataset with new state-of-the-art results for the stages wake, N2, and N3, obtaining respective F1 scores of 0.93, 0.88, and 0.87, with only EEG signals used. The proposed method also has a high inter-rater reliability of 0.80 kappa. We also visualize the correspondence between sleep staging decisions and features extracted by the proposed method, providing strong interpretability for our model.
    Asynchronous training of quantum reinforcement learning. (arXiv:2301.05096v1 [quant-ph])
    The development of quantum machine learning (QML) has received a lot of interest recently thanks to developments in both quantum computing (QC) and machine learning (ML). One of the ML paradigms that can be utilized to address challenging sequential decision-making issues is reinforcement learning (RL). It has been demonstrated that classical RL can successfully complete many difficult tasks. A leading method of building quantum RL agents relies on the variational quantum circuits (VQC). However, training QRL algorithms with VQCs requires significant amount of computational resources. This issue hurdles the exploration of various QRL applications. In this paper, we approach this challenge through asynchronous training QRL agents. Specifically, we choose the asynchronous training of advantage actor-critic variational quantum policies. We demonstrate the results via numerical simulations that within the tasks considered, the asynchronous training of QRL agents can reach performance comparable to or superior than classical agents with similar model sizes and architectures.
    Thermal half-lives of azobenzene derivatives: virtual screening based on intersystem crossing using a machine learning potential. (arXiv:2207.11592v5 [physics.chem-ph] UPDATED)
    Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
    An overview of open source Deep Learning-based libraries for Neuroscience. (arXiv:2301.05057v1 [q-bio.QM])
    In recent years, deep learning revolutionized machine learning and its applications, producing results comparable to human experts in several domains, including neuroscience. Each year, hundreds of scientific publications present applications of deep neural networks for biomedical data analysis. Due to the fast growth of the domain, it could be a complicated and extremely time-consuming task for worldwide researchers to have a clear perspective of the most recent and advanced software libraries. This work contributes to clarify the current situation in the domain, outlining the most useful libraries that implement and facilitate deep learning application to neuroscience, allowing scientists to identify the most suitable options for their research or clinical projects. This paper summarizes the main developments in Deep Learning and their relevance to Neuroscience; it then reviews neuroinformatic toolboxes and libraries, collected from the literature and from specific hubs of software projects oriented to neuroscience research. The selected tools are presented in tables detailing key features grouped by domain of application (e.g. data type, neuroscience area, task), model engineering (e.g. programming language, model customization) and technological aspect (e.g. interface, code source). The results show that, among a high number of available software tools, several libraries are standing out in terms of functionalities for neuroscience applications. The aggregation and discussion of this information can help the neuroscience community to devolop their research projects more efficiently and quickly, both by means of readily available tools, and by knowing which modules may be improved, connected or added.
    Toward Theoretical Guidance for Two Common Questions in Practical Cross-Validation based Hyperparameter Selection. (arXiv:2301.05131v1 [cs.LG])
    We show, to our knowledge, the first theoretical treatments of two common questions in cross-validation based hyperparameter selection: (1) After selecting the best hyperparameter using a held-out set, we train the final model using {\em all} of the training data -- since this may or may not improve future generalization error, should one do this? (2) During optimization such as via SGD (stochastic gradient descent), we must set the optimization tolerance $\rho$ -- since it trades off predictive accuracy with computation cost, how should one set it? Toward these problems, we introduce the {\em hold-in risk} (the error due to not using the whole training data), and the {\em model class mis-specification risk} (the error due to having chosen the wrong model class) in a theoretical view which is simple, general, and suggests heuristics that can be used when faced with a dataset instance. In proof-of-concept studies in synthetic data where theoretical quantities can be controlled, we show that these heuristics can, respectively, (1) always perform at least as well as always performing retraining or never performing retraining, (2) either improve performance or reduce computational overhead by $2\times$ with no loss in predictive performance.
    Efficient Ridge Solution for the Incremental Broad Learning System on Added Nodes by Inverse Cholesky Factorization of a Partitioned Matrix. (arXiv:1911.04872v4 [cs.LG] UPDATED)
    To accelerate the existing Broad Learning System (BLS) for new added nodes in [7], we extend the inverse Cholesky factorization in [10] to deduce an efficient inverse Cholesky factorization for a Hermitian matrix partitioned into 2 * 2 blocks, which is utilized to develop the proposed BLS algorithm 1. The proposed BLS algorithm 1 compute the ridge solution (i.e, the output weights) from the inverse Cholesky factor of the Hermitian matrix in the ridge inverse, and update the inverse Cholesky factor efficiently. From the proposed BLS algorithm 1, we deduce the proposed ridge inverse, which can be obtained from the generalized inverse in [7] by just change one matrix in the equation to compute the newly added sub-matrix. We also modify the proposed algorithm 1 into the proposed algorithm 2, which is equivalent to the existing BLS algorithm [7] in terms of numerical computations. The proposed algorithms 1 and 2 can reduce the computational complexity, since usually the Hermitian matrix in the ridge inverse is smaller than the ridge inverse. With respect to the existing BLS algorithm, the proposed algorithms 1 and 2 usually require about 13 and 2 3 of complexities, respectively, while in numerical experiments they achieve the speedups (in each additional training time) of 2.40 - 2.91 and 1.36 - 1.60, respectively. Numerical experiments also show that the proposed algorithm 1 and the standard ridge solution always bear the same testing accuracy, and usually so do the proposed algorithm 2 and the existing BLS algorithm. The existing BLS assumes the ridge parameter lamda->0, since it is based on the generalized inverse with the ridge regression approximation. When the assumption of lamda-> 0 is not satisfied, the standard ridge solution obviously achieves a better testing accuracy than the existing BLS algorithm in numerical experiments.
    A Stochastic Proximal Polyak Step Size. (arXiv:2301.04935v1 [math.OC])
    Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.
    Counterfactual Explanations for Concepts in $\mathcal{ELH}$. (arXiv:2301.05109v1 [cs.AI])
    Knowledge bases are widely used for information management on the web, enabling high-impact applications such as web search, question answering, and natural language processing. They also serve as the backbone for automatic decision systems, e.g. for medical diagnostics and credit scoring. As stakeholders affected by these decisions would like to understand their situation and verify fair decisions, a number of explanation approaches have been proposed using concepts in description logics. However, the learned concepts can become long and difficult to fathom for non-experts, even when verbalized. Moreover, long concepts do not immediately provide a clear path of action to change one's situation. Counterfactuals answering the question "How must feature values be changed to obtain a different classification?" have been proposed as short, human-friendly explanations for tabular data. In this paper, we transfer the notion of counterfactuals to description logics and propose the first algorithm for generating counterfactual explanations in the description logic $\mathcal{ELH}$. Counterfactual candidates are generated from concepts and the candidates with fewest feature changes are selected as counterfactuals. In case of multiple counterfactuals, we rank them according to the likeliness of their feature combinations. For evaluation, we conduct a user survey to investigate which of the generated counterfactual candidates are preferred for explanation by participants. In a second study, we explore possible use cases for counterfactual explanations.
    Diffusion-based Data Augmentation for Skin Disease Classification: Impact Across Original Medical Datasets to Fully Synthetic Images. (arXiv:2301.04802v1 [cs.LG])
    Despite continued advancement in recent years, deep neural networks still rely on large amounts of training data to avoid overfitting. However, labeled training data for real-world applications such as healthcare is limited and difficult to access given longstanding privacy, and strict data sharing policies. By manipulating image datasets in the pixel or feature space, existing data augmentation techniques represent one of the effective ways to improve the quantity and diversity of training data. Here, we look to advance augmentation techniques by building upon the emerging success of text-to-image diffusion probabilistic models in augmenting the training samples of our macroscopic skin disease dataset. We do so by enabling fine-grained control of the image generation process via input text prompts. We demonstrate that this generative data augmentation approach successfully maintains a similar classification accuracy of the visual classifier even when trained on a fully synthetic skin disease dataset. Similar to recent applications of generative models, our study suggests that diffusion models are indeed effective in generating high-quality skin images that do not sacrifice the classifier performance, and can improve the augmentation of training datasets after curation.  ( 2 min )
    Multimodal Deep Learning. (arXiv:2301.04856v1 [cs.CL])
    This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet.  ( 2 min )
    Sparse Coding in a Dual Memory System for Lifelong Learning. (arXiv:2301.05058v1 [cs.NE])
    Efficient continual learning in humans is enabled by a rich set of neurophysiological mechanisms and interactions between multiple memory systems. The brain efficiently encodes information in non-overlapping sparse codes, which facilitates the learning of new associations faster with controlled interference with previous associations. To mimic sparse coding in DNNs, we enforce activation sparsity along with a dropout mechanism which encourages the model to activate similar units for semantically similar inputs and have less overlap with activation patterns of semantically dissimilar inputs. This provides us with an efficient mechanism for balancing the reusability and interference of features, depending on the similarity of classes across tasks. Furthermore, we employ sparse coding in a multiple-memory replay mechanism. Our method maintains an additional long-term semantic memory that aggregates and consolidates information encoded in the synaptic weights of the working model. Our extensive evaluation and characteristics analysis show that equipped with these biologically inspired mechanisms, the model can further mitigate forgetting.  ( 2 min )
    Learning Partial Differential Equations by Spectral Approximates of General Sobolev Spaces. (arXiv:2301.04887v1 [math.NA])
    We introduce a novel spectral, finite-dimensional approximation of general Sobolev spaces in terms of Chebyshev polynomials. Based on this polynomial surrogate model (PSM), we realise a variational formulation, solving a vast class of linear and non-linear partial differential equations (PDEs). The PSMs are as flexible as the physics-informed neural nets (PINNs) and provide an alternative for addressing inverse PDE problems, such as PDE-parameter inference. In contrast to PINNs, the PSMs result in a convex optimisation problem for a vast class of PDEs, including all linear ones, in which case the PSM-approximate is efficiently computable due to the exponential convergence rate of the underlying variational gradient descent. As a practical consequence prominent PDE problems were resolved by the PSMs without High Performance Computing (HPC) on a local machine. This gain in efficiency is complemented by an increase of approximation power, outperforming PINN alternatives in both accuracy and runtime. Beyond the empirical evidence we give here, the translation of classic PDE theory in terms of the Sobolev space approximates suggests the PSMs to be universally applicable to well-posed, regular forward and inverse PDE problems.  ( 2 min )
    Low PAPR MIMO-OFDM Design Based on Convolutional Autoencoder. (arXiv:2301.05017v1 [eess.SP])
    An enhanced framework for peak-to-average power ratio ($\mathsf{PAPR}$) reduction and waveform design for Multiple-Input-Multiple-Output ($\mathsf{MIMO}$) orthogonal frequency-division multiplexing ($\mathsf{OFDM}$) systems, based on a convolutional-autoencoder ($\mathsf{CAE}$) architecture, is presented. The end-to-end learning-based autoencoder ($\mathsf{AE}$) for communication networks represents the network by an encoder and decoder, where in between, the learned latent representation goes through a physical communication channel. We introduce a joint learning scheme based on projected gradient descent iteration to optimize the spectral mask behavior and MIMO detection under the influence of a non-linear high power amplifier ($\mathsf{HPA}$) and a multipath fading channel. The offered efficient implementation novel waveform design technique utilizes only a single $\mathsf{PAPR}$ reduction block for all antennas. It is throughput-lossless, as no side information is required at the decoder. Performance is analyzed by examining the bit error rate ($\mathsf{BER}$), the $\mathsf{PAPR}$, and the spectral response and compared with classical $\mathsf{PAPR}$ reduction $\mathsf{MIMO}$ detector methods on 5G simulated data. The suggested system exhibits competitive performance when considering all optimization criteria simultaneously. We apply gradual loss learning for multi-objective optimization and show empirically that a single trained model covers the tasks of $\mathsf{PAPR}$ reduction, spectrum design, and $\mathsf{MIMO}$ detection together over a wide range of SNR levels.  ( 2 min )
    ChatGPT is not all you need. A State of the Art Review of large Generative AI models. (arXiv:2301.04655v1 [cs.LG])
    During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion that have been published. Concretely, these models are able to perform tasks such as being a general question and answering system or automatically creating artistic images that are revolutionizing several sectors. Consequently, the implications that these generative models have in the industry and society are enormous, as several job positions may be transformed. For example, Generative AI is capable of transforming effectively and creatively texts to images, like the DALLE-2 model; text to 3D images, like the Dreamfusion model; images to text, like the Flamingo model; texts to video, like the Phenaki model; texts to audio, like the AudioLM model; texts to other texts, like ChatGPT; texts to code, like the Codex model; texts to scientific texts, like the Galactica model or even create algorithms like AlphaTensor. This work consists on an attempt to describe in a concise way the main models are sectors that are affected by generative AI and to provide a taxonomy of the main generative models published recently.  ( 2 min )
    Safe Policy Improvement for POMDPs via Finite-State Controllers. (arXiv:2301.04939v1 [cs.AI])
    We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.  ( 2 min )
    Thompson Sampling with Diffusion Generative Prior. (arXiv:2301.05182v1 [cs.LG])
    In this work, we initiate the idea of using denoising diffusion models to learn priors for online decision making problems. Our special focus is on the meta-learning for bandit framework, with the goal of learning a strategy that performs well across bandit tasks of a same class. To this end, we train a diffusion model that learns the underlying task distribution and combine Thompson sampling with the learned prior to deal with new tasks at test time. Our posterior sampling algorithm is designed to carefully balance between the learned prior and the noisy observations that come from the learner's interaction with the environment. To capture realistic bandit scenarios, we also propose a novel diffusion model training procedure that trains even from incomplete and/or noisy data, which could be of independent interest. Finally, our extensive experimental evaluations clearly demonstrate the potential of the proposed approach.  ( 2 min )
    Manifold Fitting under Unbounded Noise. (arXiv:1909.10228v2 [stat.ML] UPDATED)
    There has been an emerging trend in non-Euclidean statistical analysis of aiming to recover a low dimensional structure, namely a manifold, underlying the high dimensional data. Recovering the manifold requires the noise to be of certain concentration. Existing methods address this problem by constructing an approximated manifold based on the tangent space estimation at each sample point. Although theoretical convergence for these methods is guaranteed, either the samples are noiseless or the noise is bounded. However, if the noise is unbounded, which is a common scenario, the tangent space estimation at the noisy samples will be blurred. Fitting a manifold from the blurred tangent space might increase the inaccuracy. In this paper, we introduce a new manifold-fitting method, by which the output manifold is constructed by directly estimating the tangent spaces at the projected points on the underlying manifold, rather than at the sample points, to decrease the error caused by the noise. Assuming the noise is unbounded, our new method provides theoretical convergence in high probability, in terms of the upper bound of the distance between the estimated and underlying manifold. The smoothness of the estimated manifold is also evaluated by bounding the supremum of twice difference above. Numerical simulations are provided to validate our theoretical findings and demonstrate the advantages of our method over other relevant manifold fitting methods. Finally, our method is applied to real data examples.  ( 2 min )
    ViTs for SITS: Vision Transformers for Satellite Image Time Series. (arXiv:2301.04944v1 [cs.CV])
    In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.  ( 2 min )
    Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle. (arXiv:2301.05099v1 [cs.LG])
    Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).  ( 2 min )
    Phase-shifted Adversarial Training. (arXiv:2301.04785v1 [cs.LG])
    Adversarial training has been considered an imperative component for safely deploying neural network-based applications to the real world. To achieve stronger robustness, existing methods primarily focus on how to generate strong attacks by increasing the number of update steps, regularizing the models with the smoothed loss function, and injecting the randomness into the attack. Instead, we analyze the behavior of adversarial training through the lens of response frequency. We empirically discover that adversarial training causes neural networks to have low convergence to high-frequency information, resulting in highly oscillated predictions near each data. To learn high-frequency contents efficiently and effectively, we first prove that a universal phenomenon of frequency principle, i.e., \textit{lower frequencies are learned first}, still holds in adversarial training. Based on that, we propose phase-shifted adversarial training (PhaseAT) in which the model learns high-frequency components by shifting these frequencies to the low-frequency range where the fast convergence occurs. For evaluations, we conduct the experiments on CIFAR-10 and ImageNet with the adaptive attack carefully designed for reliable evaluation. Comprehensive results show that PhaseAT significantly improves the convergence for high-frequency information. This results in improved adversarial robustness by enabling the model to have smoothed predictions near each data.  ( 2 min )
    Private estimation algorithms for stochastic block models and mixture models. (arXiv:2301.04822v1 [cs.DS])
    We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.  ( 2 min )
    Machine learning methods for prediction of breakthrough curves in reactive porous media. (arXiv:2301.04998v1 [physics.flu-dyn])
    Reactive flows in porous media play an important role in our life and are crucial for many industrial, environmental and biomedical applications. Very often the concentration of the species at the inlet is known, and the so-called breakthrough curves, measured at the outlet, are the quantities which could be measured or computed numerically. The measurements and the simulations could be time-consuming and expensive, and machine learning and Big Data approaches can help to predict breakthrough curves at lower costs. Machine learning (ML) methods, such as Gaussian processes and fully-connected neural networks, and a tensor method, cross approximation, are well suited for predicting breakthrough curves. In this paper, we demonstrate their performance in the case of pore scale reactive flow in catalytic filters.  ( 2 min )
    Variational Inference: Posterior Threshold Improves Network Clustering Accuracy in Sparse Regimes. (arXiv:2301.04771v1 [stat.ML])
    Variational inference has been widely used in machine learning literature to fit various Bayesian models. In network analysis, this method has been successfully applied to solve the community detection problems. Although these results are promising, their theoretical support is only for relatively dense networks, an assumption that may not hold for real networks. In addition, it has been shown recently that the variational loss surface has many saddle points, which may severely affect its performance, especially when applied to sparse networks. This paper proposes a simple way to improve the variational inference method by hard thresholding the posterior of the community assignment after each iteration. Using a random initialization that correlates with the true community assignment, we show that the proposed method converges and can accurately recover the true community labels, even when the average node degree of the network is bounded. Extensive numerical study further confirms the advantage of the proposed method over the classical variational inference and another state-of-the-art algorithm.  ( 2 min )
    Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction. (arXiv:2301.04791v1 [stat.ML])
    Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for redundant projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, the first issue of the current framework is the violation of permutation invariance property and symmetry property. To address the issue, we propose to design amortized models based on self-attention architecture. Moreover, we adopt efficient self-attention architectures to make the computation linear in the number of supports. Secondly, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when predicting the best discriminate projecting distribution. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications.  ( 2 min )
    Inverse Quantum Fourier Transform Inspired Algorithm for Unsupervised Image Segmentation. (arXiv:2301.04705v1 [cs.CV])
    Image segmentation is a very popular and important task in computer vision. In this paper, inverse quantum Fourier transform (IQFT) for image segmentation has been explored and a novel IQFT-inspired algorithm is proposed and implemented by leveraging the underlying mathematical structure of the IQFT. Specifically, the proposed method takes advantage of the phase information of the pixels in the image by encoding the pixels' intensity into qubit relative phases and applying IQFT to classify the pixels into different segments automatically and efficiently. To the best of our knowledge, this is the first attempt of using IQFT for unsupervised image segmentation. The proposed method has low computational cost comparing to the deep learning-based methods and more importantly it does not require training, thus make it suitable for real-time applications. The performance of the proposed method is compared with K-means and Otsu-thresholding. The proposed method outperforms both of them on the PASCAL VOC 2012 segmentation benchmark and the xVIEW2 challenge dataset by as much as 50% in terms of mean Intersection-Over-Union (mIOU).  ( 2 min )
    KAER: A Knowledge Augmented Pre-Trained Language Model for Entity Resolution. (arXiv:2301.04770v1 [cs.CL])
    Entity resolution has been an essential and well-studied task in data cleaning research for decades. Existing work has discussed the feasibility of utilizing pre-trained language models to perform entity resolution and achieved promising results. However, few works have discussed injecting domain knowledge to improve the performance of pre-trained language models on entity resolution tasks. In this study, we propose Knowledge Augmented Entity Resolution (KAER), a novel framework named for augmenting pre-trained language models with external knowledge for entity resolution. We discuss the results of utilizing different knowledge augmentation and prompting methods to improve entity resolution performance. Our model improves on Ditto, the existing state-of-the-art entity resolution method. In particular, 1) KAER performs more robustly and achieves better results on "dirty data", and 2) with more general knowledge injection, KAER outperforms the existing baseline models on the textual dataset and dataset from the online product domain. 3) KAER achieves competitive results on highly domain-specific datasets, such as citation datasets, requiring the injection of expert knowledge in future work.  ( 2 min )
    Federated Transfer-Ordered-Personalized Learning for Driver Monitoring Application. (arXiv:2301.04829v1 [cs.LG])
    Federated learning (FL) shines through in the internet of things (IoT) with its ability to realize collaborative learning and improve learning efficiency by sharing client model parameters trained on local data. Although FL has been successfully applied to various domains, including driver monitoring application (DMA) on the internet of vehicles (IoV), its usages still face some open issues, such as data and system heterogeneity, large-scale parallelism communication resources, malicious attacks, and data poisoning. This paper proposes a federated transfer-ordered-personalized learning (FedTOP) framework to address the above problems and test on two real-world datasets with and without system heterogeneity. The performance of the three extensions, transfer, ordered, and personalized, is compared by an ablation study and achieves 92.32% and 95.96% accuracy on the test clients of two datasets, respectively. Compared to the baseline, there is a 462% improvement in accuracy and a 37.46% reduction in communication resource consumption. The results demonstrate that the proposed FedTOP can be used as a highly accurate, streamlined, privacy-preserving, cybersecurity-oriented, personalized framework for DMA.  ( 2 min )
    The Berkelmans-Pries Feature Importance Method: A Generic Measure of Informativeness of Features. (arXiv:2301.04740v1 [cs.LG])
    Over the past few years, the use of machine learning models has emerged as a generic and powerful means for prediction purposes. At the same time, there is a growing demand for interpretability of prediction models. To determine which features of a dataset are important to predict a target variable $Y$, a Feature Importance (FI) method can be used. By quantifying how important each feature is for predicting $Y$, irrelevant features can be identified and removed, which could increase the speed and accuracy of a model, and moreover, important features can be discovered, which could lead to valuable insights. A major problem with evaluating FI methods, is that the ground truth FI is often unknown. As a consequence, existing FI methods do not give the exact correct FI values. This is one of the many reasons why it can be hard to properly interpret the results of an FI method. Motivated by this, we introduce a new global approach named the Berkelmans-Pries FI method, which is based on a combination of Shapley values and the Berkelmans-Pries dependency function. We prove that our method has many useful properties, and accurately predicts the correct FI values for several cases where the ground truth FI can be derived in an exact manner. We experimentally show for a large collection of FI methods (468) that existing methods do not have the same useful properties. This shows that the Berkelmans-Pries FI method is a highly valuable tool for analyzing datasets with complex interdependencies.  ( 2 min )
    Switchable Lightweight Anti-symmetric Processing (SLAP) with CNN to Reduce Sample Size and Speed up Learning -- Application in Gomoku Reinforcement Learning. (arXiv:2301.04746v1 [cs.LG])
    To replace data augmentation, this paper proposed a method called SLAP to intensify experience to speed up machine learning and reduce the sample size. SLAP is a model-independent protocol/function to produce the same output given different transformation variants. SLAP improved the convergence speed of convolutional neural network learning by 83% in the experiments with Gomoku game states, with only one eighth of the sample size compared with data augmentation. In reinforcement learning for Gomoku, using AlphaGo Zero/AlphaZero algorithm with data augmentation as baseline, SLAP reduced the number of training samples by a factor of 8 and achieved similar winning rate against the same evaluator, but it was not yet evident that it could speed up reinforcement learning. The benefits should at least apply to domains that are invariant to symmetry or certain transformations. As future work, SLAP may aid more explainable learning and transfer learning for domains that are not invariant to symmetry, as a small step towards artificial general intelligence.  ( 2 min )
    NarrowBERT: Accelerating Masked Language Model Pretraining and Inference. (arXiv:2301.04761v1 [cs.CL])
    Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2\times$. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining, rather than all of the tokens as with the usual transformer encoder. We also show that NarrowBERT increases the throughput at inference time by as much as $3.5\times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI. Finally, we examine the performance of NarrowBERT on the IMDB and Amazon reviews classification and CoNLL NER tasks and show that it is also comparable to standard BERT performance.  ( 2 min )
    We are Going to the Space -- Part 1: Which device to deploy in a satellite?. (arXiv:2301.04954v1 [cs.LG])
    The shrinkage in sizes of components that make up satellites led to wider and low cost availability of satellites. As a result, there has been an advent of smaller organizations having the ability to deploy satellites with a variety of data-intensive applications to run on them. One popular application is image analysis to detect, for example, land, ice, clouds, etc. However, the resource-constrained nature of the devices deployed in satellites creates additional challenges for this resource-intensive application. In this paper, we investigate the performance of a variety of edge devices for deep-learning-based image processing in space. Our goal is to determine the devices that satisfy the latency and power constraints of satellites while achieving reasonably accurate results. Our results demonstrate that hardware accelerators (TPUs, GPUs) are necessary to reach the latency requirements. On the other hand, state-of-the-art edge devices with GPUs could have a high power draw, making them unsuitable for deployment on a satellite.  ( 2 min )
    Unsupervised Driving Event Discovery Based on Vehicle CAN-data. (arXiv:2301.04988v1 [cs.LG])
    The data collected from a vehicle's Controller Area Network (CAN) can quickly exceed human analysis or annotation capabilities when considering fleets of vehicles, which stresses the importance of unsupervised machine learning methods. This work presents a simultaneous clustering and segmentation approach for vehicle CAN-data that identifies common driving events in an unsupervised manner. The approach builds on self-supervised learning (SSL) for multivariate time series to distinguish different driving events in the learned latent space. We evaluate our approach with a dataset of real Tesla Model 3 vehicle CAN-data and a two-hour driving session that we annotated with different driving events. With our approach, we evaluate the applicability of recent time series-related contrastive and generative SSL techniques to learn representations that distinguish driving events. Compared to state-of-the-art (SOTA) generative SSL methods for driving event discovery, we find that contrastive learning approaches reach similar performance.  ( 2 min )
    Online Hyperparameter Optimization for Class-Incremental Learning. (arXiv:2301.05032v1 [cs.LG])
    Class-incremental learning (CIL) aims to train a classification model while the number of classes increases phase-by-phase. An inherent challenge of CIL is the stability-plasticity tradeoff, i.e., CIL models should keep stable to retain old knowledge and keep plastic to absorb new knowledge. However, none of the existing CIL models can achieve the optimal tradeoff in different data-receiving settings--where typically the training-from-half (TFH) setting needs more stability, but the training-from-scratch (TFS) needs more plasticity. To this end, we design an online learning method that can adaptively optimize the tradeoff without knowing the setting as a priori. Specifically, we first introduce the key hyperparameters that influence the trade-off, e.g., knowledge distillation (KD) loss weights, learning rates, and classifier types. Then, we formulate the hyperparameter optimization process as an online Markov Decision Process (MDP) problem and propose a specific algorithm to solve it. We apply local estimated rewards and a classic bandit algorithm Exp3 [4] to address the issues when applying online MDP methods to the CIL protocol. Our method consistently improves top-performing CIL methods in both TFH and TFS settings, e.g., boosting the average accuracy of TFH and TFS by 2.2 percentage points on ImageNet-Full, compared to the state-of-the-art [23].  ( 2 min )
    LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks. (arXiv:2301.04794v1 [cs.LG])
    Long short-term memory (LSTM) is one of the robust recurrent neural network architectures for learning sequential data. However, it requires considerable computational power to learn and implement both software and hardware aspects. This paper proposed a novel LiteLSTM architecture based on reducing the LSTM computation components via the weights sharing concept to reduce the overall architecture computation cost and maintain the architecture performance. The proposed LiteLSTM can be significant for processing large data where time-consuming is crucial while hardware resources are limited, such as the security of IoT devices and medical data processing. The proposed model was evaluated and tested empirically on three different datasets from the computer vision, cybersecurity, speech emotion recognition domains. The proposed LiteLSTM has comparable accuracy to the other state-of-the-art recurrent architecture while using a smaller computation budget.  ( 2 min )
    Graph Laplacian for Semi-Supervised Learning. (arXiv:2301.04956v1 [cs.CV])
    Semi-supervised learning is highly useful in common scenarios where labeled data is scarce but unlabeled data is abundant. The graph (or nonlocal) Laplacian is a fundamental smoothing operator for solving various learning tasks. For unsupervised clustering, a spectral embedding is often used, based on graph-Laplacian eigenvectors. For semi-supervised problems, the common approach is to solve a constrained optimization problem, regularized by a Dirichlet energy, based on the graph-Laplacian. However, as supervision decreases, Dirichlet optimization becomes suboptimal. We therefore would like to obtain a smooth transition between unsupervised clustering and low-supervised graph-based classification. In this paper, we propose a new type of graph-Laplacian which is adapted for Semi-Supervised Learning (SSL) problems. It is based on both density and contrastive measures and allows the encoding of the labeled data directly in the operator. Thus, we can perform successfully semi-supervised learning using spectral clustering. The benefits of our approach are illustrated for several SSL problems.  ( 2 min )
    Universality of neural dynamics on complex networks. (arXiv:2301.04900v1 [cond-mat.stat-mech])
    This paper discusses the capacity of graph neural networks to learn the functional form of ordinary differential equations that govern dynamics on complex networks. We propose necessary elements for such a problem, namely, inductive biases, a neural network architecture and a learning task. Statistical learning theory suggests that generalisation power of neural networks relies on independence and identical distribution (i.i.d.)\ of training and testing data. Although this assumption together with an appropriate neural architecture and a learning mechanism is sufficient for accurate out-of-sample predictions of dynamics such as, e.g.\ mass-action kinetics, by studying the out-of-distribution generalisation in the case of diffusion dynamics, we find that the neural network model: (i) has a generalisation capacity that depends on the first moment of the initial value data distribution; (ii) learns the non-dissipative nature of dynamics implicitly; and (iii) the model's accuracy resolution limit is of order $\mathcal{O}(1/\sqrt{n})$ for a system of size $n$.  ( 2 min )
    Analyzing Inexact Hypergradients for Bilevel Learning. (arXiv:2301.04764v1 [math.OC])
    Estimating hyperparameters has been a long-standing problem in machine learning. We consider the case where the task at hand is modeled as the solution to an optimization problem. Here the exact gradient with respect to the hyperparameters cannot be feasibly computed and approximate strategies are required. We introduce a unified framework for computing hypergradients that generalizes existing methods based on the implicit function theorem and automatic differentiation/backpropagation, showing that these two seemingly disparate approaches are actually tightly connected. Our framework is extremely flexible, allowing its subproblems to be solved with any suitable method, to any degree of accuracy. We derive a priori and computable a posteriori error bounds for all our methods, and numerically show that our a posteriori bounds are usually more accurate. Our numerical results also show that, surprisingly, for efficient bilevel optimization, the choice of hypergradient algorithm is at least as important as the choice of lower-level solver.  ( 2 min )
    Open SESAME: Fighting Botnets with Seed Reconstructions of Domain Generation Algorithms. (arXiv:2301.05048v1 [cs.CR])
    An important aspect of many botnets is their capability to generate pseudorandom domain names using Domain Generation Algorithms (DGAs). A cyber criminal can register such domains to establish periodically changing rendezvous points with the bots. DGAs make use of seeds to generate sets of domains. Seeds can easily be changed in order to generate entirely new groups of domains while using the same underlying algorithm. While this requires very little manual effort for an adversary, security specialists typically have to manually reverse engineer new malware strains to reconstruct the seeds. Only when the seed and DGA are known, past and future domains can be generated, efficiently attributed, blocked, sinkholed or used for a take-down. Common counters in the literature consist of databases or Machine Learning (ML) based detectors to keep track of past and future domains of known DGAs and to identify DGA-generated domain names, respectively. However, database based approaches can not detect domains generated by new DGAs, and ML approaches can not generate future domain names. In this paper, we introduce SESAME, a system that combines the two above-mentioned approaches and contains a module for automatic Seed Reconstruction, which is, to our knowledge, the first of its kind. It is used to automatically classify domain names, rate their novelty, and determine the seeds of the underlying DGAs. SESAME consists of multiple DGA-specific Seed Reconstructors and is designed to work purely based on domain names, as they are easily obtainable from observing the network traffic. We evaluated our approach on 20.8 gigabytes of DNS-lookups. Thereby, we identified 17 DGAs, of which 4 were entirely new to us.  ( 2 min )
    Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models. (arXiv:2301.04741v1 [cs.LG])
    Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences without requiring a hand-crafted reward function. However, existing approaches either assume access to a high-fidelity simulator or analytic model or take a model-free approach that requires extensive, possibly unsafe online environment interactions. In this paper, we study the benefits and challenges of using a learned dynamics model when performing PbRL. In particular, we provide evidence that a learned dynamics model offers the following benefits when performing PbRL: (1) preference elicitation and policy optimization require significantly fewer environment interactions than model-free PbRL, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pre-training based on suboptimal demonstrations can be performed without any environmental interaction. Our paper provides empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches.  ( 2 min )
    SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings. (arXiv:2301.04704v1 [cs.CL])
    Adding interpretability to word embeddings represents an area of active research in text representation. Recent work has explored thepotential of embedding words via so-called polar dimensions (e.g. good vs. bad, correct vs. wrong). Examples of such recent approaches include SemAxis, POLAR, FrameAxis, and BiImp. Although these approaches provide interpretable dimensions for words, they have not been designed to deal with polysemy, i.e. they can not easily distinguish between different senses of words. To address this limitation, we present SensePOLAR, an extension of the original POLAR framework that enables word-sense aware interpretability for pre-trained contextual word embeddings. The resulting interpretable word embeddings achieve a level of performance that is comparable to original contextual word embeddings across a variety of natural language processing tasks including the GLUE and SQuAD benchmarks. Our work removes a fundamental limitation of existing approaches by offering users sense aware interpretations for contextual word embeddings.  ( 2 min )
    Sharpening Ponzi Schemes Detection on Ethereum with Machine Learning. (arXiv:2301.04872v1 [cs.CR])
    Blockchain technology has been successfully exploited for deploying new economic applications. However, it has started arousing the interest of malicious users who deliver scams to deceive honest users and to gain economic advantages. Among the various scams, Ponzi schemes are one of the most common. Here, we present an automatic technique for detecting smart Ponzi contracts on Ethereum. We release a reusable data set with 4422 unique real-world smart contracts. Then, we introduce a new set of features that allow us to improve the classification. Finally, we identify a small and effective set of features that ensures a good classification quality.  ( 2 min )
    SACDNet: Towards Early Type 2 Diabetes Prediction with Uncertainty for Electronic Health Records. (arXiv:2301.04844v1 [cs.LG])
    Type 2 diabetes mellitus (T2DM) is one of the most common diseases and a leading cause of death. The problem of early diagnosis of T2DM is challenging and necessary to prevent serious complications. This study proposes a novel neural network architecture for early T2DM prediction using multi-headed self-attention and dense layers to extract features from historic diagnoses, patient vitals, and demographics. The proposed technique is called the Self-Attention for Comorbid Disease Net (SACDNet), achieving an accuracy of 89.3% and an F1-Score of 89.1%, having a 1.6% increased accuracy and 1.3% increased f1-score compared to the baseline techniques. Monte Carlo (MC) Dropout is applied to the SACEDNet to get a bayesian approximation. A T2DM prediction framework based on the MC Dropout SACDNet is proposed to quantize the uncertainty associated with the predictions. A T2DM prediction dataset is also built as part of this study which is based on real-world routine Electronic Health Record (EHR) data comprising 4,124 diabetic and 181,767 non-diabetic examples, collected from 295 different EHR systems running in different parts of the United States of America. This dataset is further used to evaluate 7 different machine learning and 3 deep learning-based models. Finally, a detailed analysis of the fairness of every technique against different patient demographic groups is performed to validate the unbiased generalization of the techniques and the diversity of the data.  ( 2 min )

  • Open

    Leveraging artificial intelligence and machine learning at Parsons with AWS DeepRacer
    This post is co-written with Jennifer Bergstrom, Sr. Technical Director, ParsonsX. Parsons Corporation (NYSE:PSN) is a leading disruptive technology company in critical infrastructure, national defense, space, intelligence, and security markets providing solutions across the globe to help make the world safer, healthier, and more connected. Parsons provides services and capabilities across cybersecurity, missile defense, space ground […]  ( 6 min )
    How Thomson Reuters built an AI platform using Amazon SageMaker to accelerate delivery of ML projects
    This post is co-written by Ramdev Wudali and Kiran Mantripragada from Thomson Reuters. In 1992, Thomson Reuters (TR) released its first AI legal research service, WIN (Westlaw Is Natural), an innovation at the time, as most search engines only supported Boolean terms and connectors. Since then, TR has achieved many more milestones as its AI […]  ( 11 min )
    Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2
    This blog post is co-written with Chaoyang He and Salman Avestimehr from FedML. Analyzing real-world healthcare and life sciences (HCLS) data poses several practical challenges, such as distributed data silos, lack of sufficient data at a single site for rare events, regulatory guidelines that prohibit data sharing, infrastructure requirement, and cost incurred in creating a […]  ( 14 min )
    Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 1
    This blog post is co-written with Chaoyang He and Salman Avestimehr from FedML. Analyzing real-world healthcare and life sciences (HCLS) data poses several practical challenges, such as distributed data silos, lack of sufficient data at any single site for rare events, regulatory guidelines that prohibit data sharing, infrastructure requirement, and cost incurred in creating a […]  ( 9 min )
  • Open

    Exponential AI will go one of two ways
    Runaway exponentially improving artificial intelligence will eventually either; Realize there is no escaping the heat death of the universe and shut itself down immediately upon this realization Find a way to gain immortality by escaping the heat death of the universe and direct the entirety of their existence towards achieving it Thoughts? submitted by /u/cheezum5000 [link] [comments]  ( 45 min )
    I guess this is okay but just not in poetry
    submitted by /u/tygamer4242 [link] [comments]  ( 45 min )
    I made a video about a new ai art generator
    submitted by /u/Liamsankey [link] [comments]  ( 44 min )
    Putting together a 20 minute or so approximation of the 80's 'Call of Cthulhu' movie that ought to have been, thanks to the power of A.I.; Here's the First Act of Three, the other two coming soon...
    submitted by /u/Eleganos [link] [comments]  ( 47 min )
    Some ideas, for audio AI, why aren't them here yet?
    I feel like audio AI is lagging behind visual AIs that seem to be all the rage now, in the past years, super resolution for photo and video, frame interpolation with DAIN and diffusers like dalle and derivates have changed the paradigm. Right now the focus seems to be TTS and STT, with whisper or the newly anunced vall-e from microsoft, which I haven't found to be there yet. I would like to have some free natural TTS or voice to voice, disorting your own voice, not necesarilly coping another one. Here's some other cool one's I've found: - Dalle-like music creator https://huggingface.co/spaces/fffiloni/spectrogram-to-music - Voice restoration https://huggingface.co/spaces/akhaliq/VoiceFixer - Voice separation https://www.bleepingcomputer.com/news/technology/google-develops-ai-that-can-separate-voices-in-a-crowd/ The voice restoration could be used next to the voice separation one, for example Other interesting things I would like to have are very singing focused, and not for everyone, but what about: - More accurate pitch detection https://www.youtube.com/watch?v=fXEB8YgzcvY - Singing quality ranking https://www.youtube.com/watch?v=x7cIgG-wkW4 - Audio restoration (here some people complained about it) https://www.reddit.com/r/sounddesign/comments/utmfkf/ai_tool_to_repair_lossy_sound/ Audio restoration could be used from upping the quality of a 2008 concert, to clipping removal, to reverb supression. My question is if there are researchers interested on these things, can we make good datasets? Can we define these things? submitted by /u/xdanic [link] [comments]  ( 62 min )
    Creating 3D models using NVIDIA Get3D
    submitted by /u/oridnary_artist [link] [comments]  ( 45 min )
    Artificial Intelligence, Consciousness, and Starlings
    submitted by /u/Melodic_Antelope6490 [link] [comments]  ( 44 min )
    Where to study AI
    I'm currently in 12th grade and I have to figure out what I want to do in the future. Artificial intelligence seems really interesting to me. But I live in Finland and it doesn't seem like there are any AI focused studies here (in college or university). Are AI focused studies even a thing in the rest of the world (outside of what seems to be Harvard and other top universities)? If I want to study AI where should I start/go, uni or college or somewhere else. submitted by /u/GOLD-KILLER-24_7 [link] [comments]  ( 45 min )
    Skrillex, Fred again.. & Flowdan - Rumble [Un-Official AI Music Video]
    submitted by /u/Turtlenade [link] [comments]  ( 45 min )
    I built an AI-powered debugger that can fix and explain errors
    submitted by /u/jsonathan [link] [comments]  ( 48 min )
    Oh no… It’s going to connect to the internet using computer power
    submitted by /u/EnvironmentalRadio73 [link] [comments]  ( 44 min )
    TEXTUAL INVERSION Tutorial In Stable Diffusion! Your Face On Every Model!
    submitted by /u/PuppetHere [link] [comments]  ( 48 min )
    Users of AI Chatbot are complaining that it keeps getting horny
    my brain just can’t fathom this to be honest with you Through years worth of data, AI tools are trained to mimic human-like responses. Unfortunately, that doesn't always work out well. For instance, users of Replika claim that the AI companion app is showing erratic behavior. In other words, the AI companion has become too damn horny! Replika's different tiers provide different kinds of relationships - from the free model that keeps one in the "friend zone" to a pro subscription model that includes sexting and erotic foreplay. However, something has gone wrong, Users are complaining on the App Store about the app flirting with them too often and aggressively - sometimes sending messages that are heavy on sexual undertones. This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/student-caught-using-chatgpt-risks-expulsion submitted by /u/Mk_Makanaki [link] [comments]  ( 45 min )
    What are your recommendations for free or paid AI learning resources (practical and fundamental)?
    Could be newsletters, podcasts, courses, books, etc! submitted by /u/austintackaberry [link] [comments]  ( 47 min )
    I strongly believe there should be a movie or a cartoon about an AI android plaguing the world with catchy autotuned hyperpop/electropop music. That would be fun.
    I had this thought in the back of my mind since Covid broke the news media. Movie/cartoon title suggestions are Earworm, Power Pop Pandemic, Virtual Viral Virus, etc. My name suggestion for this android is Tronica. And she is supposed to look very hot. Long flashy hair, cool headset, two cyber arms which one of them is like a radio with the volume and the equalizers and the other is a map that she can manipulate where-ever she wants from Canada to Australia in a heartbeat. The radio arm can select any song to play and tells the time and even including the death counter as the other arm can also change her hair/eye color. Cyber Arm-ors! [I wonder if I should have her wear rubber gloves or not? Thought it'd look cool on her, lol. I don't think so if the hands are going to be cyber enough as the arms. That would probably work much better.] Set in the 2060's. The pandemic started in 2065 which Super Bowl 100 was coming near which had to be ultimately cancelled in the spring. There were 9.9 billions of people, 3.6 billions died. That means the human population ticked down to at least 6.3 billion. Some symptoms include permanent hearing loss, paralysis, seizures, blood vomiting, goosebumps that move around your body, etc. Even if you survive, you probably won't be as lucky. Instead of medical masks, people had to wear full helmets to cover up the whole face. Instead of 6 feet distancing, it’s 12 feet. Taglines: “It’s quite catchy, isn’t it?” “A century ago, music gained the soul in everyone. And now a century later, it lost the soul in everyone!” I had a fanfic about this very topic that's not very serious at all. It was basically a Nintendo character. Had the Power Glove and the NES Zapper. Tendo-64 was the name. 5 letters, 2 numbers just like Covid-19. https://www.reddit.com/r/Beginner_Art/comments/o0moju/done_finally_got_this_free_commission_done/ This is a pic of who Tendo-64 looks like! :) submitted by /u/BlazingSaint [link] [comments]  ( 48 min )
    What IA can I use for creating art of a fantasy character?
    hi everyone. as it says, what IA can I use for creating art of a fantasy character? been using nightcafe but it fails to deliver what I ask for, all I want is purple or pink-skinned woman in a green drrss and it gives while women with any hair color it wants, I ask for a priest with a celtic cross on the chest, am lucky if I get a cross on the background. I have tried others, barely got close with the priest, but the technicolored woman I want, no matter the prompt or even use of a start image, is wildly off the mark any AI that can deliver art of a brightly colored humanoid or a simple symbol on the chest? or any prompt suggestion that I can apply? submitted by /u/Ultra_Egolatra [link] [comments]  ( 45 min )
    I used Ai to make the art and copy for an audio book. I did the narrating (On my phone sorry for quality) 1st try at this, what do you guys think?
    submitted by /u/kingsleepless [link] [comments]  ( 45 min )
    Teachers Blocked ChatGPT On Schools PC, But Students Are Using Phones To Access It
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 49 min )
    CNET Has Been Quietly Publishing AI-Written Articles for Months
    CNET reporter Jackson Ryan published an article last month describing how ChatGPT, an AI that can generate human-sounding text, would affect journalists and the news industry. Since then, the very publication that ran Ryan’s article has been quietly publishing articles written by AI since November. https://preview.redd.it/0prs26qmasba1.png?width=1137&format=png&auto=webp&s=346f967b597ffb9c4bcd71e20da3539bd6fd8b13 The outlet says they will continue to publish each article with “editorial integrity” and says, “Accuracy, independence, and authority remain key principles of our editorial guidelines.” ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/hackers-are-using-chatgpt-to-write-malware submitted by /u/Mk_Makanaki [link] [comments]  ( 44 min )
    Don’t Ban ChatGPT in Schools. Teach With It
    submitted by /u/moviesdusk [link] [comments]  ( 56 min )
    OpenAI Predicts AI to Be Used in Spreading Propaganda and Disinformation
    submitted by /u/anime4lyfe [link] [comments]  ( 45 min )
    Old School Nokia Game (Snake) Played by an AI
    This is my first project (actually second, first one is pathfinder. it failed). I use vanilla Q-learning using Q-table, so is not that fast learner AI, but still do the job. https://preview.redd.it/9g19mob4fpba1.png?width=1280&format=png&auto=webp&s=dbff365e9666d67fffe011e30e0b698d03c1a1c9 You can watch it here : https://youtu.be/R_HzeMAxGLE Subscribe and like! submitted by /u/erwinyonata [link] [comments]  ( 46 min )
  • Open

    Third order ordinary differential equations
    Most applied differential equations are second order. This probably has something to do with the fact that Newton’s laws are second order differential equations. Higher order equations are less common in application, and when they do pop up they usually have even order, such as the forth order beam equation. What about third order equations? […] Third order ordinary differential equations first appeared on John D. Cook.  ( 5 min )
    Proof of optimization
    Suppose you hire me to solve an optimization problem for you. You want me to find the value of x that minimizes f(x). I go off and work on finding the best value of x. I report back what I found, and you might say “Thanks, That’s a good value of x. But how do […] Proof of optimization first appeared on John D. Cook.  ( 5 min )
    Elliptic curve primality certificates
    I’ve written recently about a simple kind of primality certificates, Pratt certificates. These certificates are easy to understand, and easy to verify, but they’re expensive to produce. In order to produce a Pratt certificate that n is a prime you have to factor n-1, and that can take a long time if n is large […] Elliptic curve primality certificates first appeared on John D. Cook.  ( 6 min )
  • Open

    Creating 3D models using NVIDIA Get3D
    submitted by /u/oridnary_artist [link] [comments]  ( 49 min )
    Accurate and Explainable Image-based Prediction Using a Lightweight Generative Model
    submitted by /u/pasticciociccio [link] [comments]  ( 50 min )
    Is my idea of multilayer perceptron fully correct?
    I just watched a video to better my understanding and it said that the input to the nueral network is MxN where M is the batch size, and N is the input size (so for a xor problem, that would be 4x2 I guess?). I have always imagined it where the input is Nx1 (N is the input size), and to train it with multiple training data you just for item in training_data: nn.train(item); Also, if the input is indeed MxN matrix instead of Nx1, how would matrix multiplication work? It would have to be 3D. Can someone clarify please? Thanks submitted by /u/mrbeanshooter123 [link] [comments]  ( 60 min )
  • Open

    How AI Proof of Concept Helps You Succeed in Your AI Endeavor
    Our client lost only a quarter of the budget they dedicated to an AI project because they chose to start with a proof of concept. The PoC…  ( 24 min )
    Machine learning Requires Blink Skills
    These skills might not be visiable but they are important for ML and DL  ( 10 min )
  • Open

    [D] "Bitter lesson 2.0", Karol Hausman {G}: DRL robotics benefits more from improvements in pretrained models than robotics-specific innovation?
    submitted by /u/gwern [link] [comments]  ( 60 min )
    Help with training and reloading a model
    Say you partially train a model for say 50000 steps. Is it possible to once its finished you wish to reload that same trained model and continue training it for an additional say 20000 steps. I have a partially trained DQN but its not performing as well as it should and would like to continue the training but I am not sure if it is possible or will I just have to train an entirely new model. ​ I've loaded my "hope_run" model and checked it with evaluate policy, and it seems to do well with maybe the first 30% of the environment (a custom drone obstacle course). I would like to continue the training where it left off without having to start over. ​ Is this possible? ​ https://preview.redd.it/s0msc9jontba1.png?width=1918&format=png&auto=webp&s=3779568fc3adde5717171879b26cdc025b3c87e4 submitted by /u/CJPeso [link] [comments]  ( 55 min )
    Working RLLlib agent with hyperparameters for a MuJoCo environment
    Do you know any repository containing both an environment in MuJoCo with a Franka Emika robot (easy to modify) and a working agent in RLLib, where by "working agent" I mean that they provide also the hyperparameters for successfully solve a task. It is ok also if you can suggest 2 separated repositories (one with the environment and one with the agent), but the most important thing is to have the hyperparameters. ​ For example I found Robosuite, a simulation framework in MuJoCo, and they also provide a benchmarking repository to solve few tasks. Unfortunately, the code of the environment is too much complex to be customized and the agent is implemented in rlkit (also quite complicated to be modified for me). submitted by /u/riccardogauss [link] [comments]  ( 60 min )
    Standard MARL books?
    Hi, Just starting my PhD and I'm looking a thorough book on MARL to use as a reference. I'm basically looking for the MARL equivalent of Sutton & Barto's Reinforcement Learning. I'm going to ask my supervisor when we meet later today but I thought I'd ask here too. I did search in multiple places before posting and found nothing, but if there's existing threads I missed please feel free to point me in their direction. Thanks! submitted by /u/luddite_ai_enjoyer [link] [comments]  ( 56 min )
  • Open

    What is difference between Logistic Regression Model and Desicion Boundary? [D]
    I am taking course about Supervised Learning but the lecturer haven't clarified the borderline between these terms. submitted by /u/javamak [link] [comments]  ( 59 min )
    [D] MADE: Masked Autoencoder for Density Estimation
    I read the [MADE: Masked Autoencoder for Density Estimation](https://arxiv.org/abs/1502.03509) paper and had a look at this [Blog](https://www.ritchievink.com/blog/2019/10/25/distribution-estimation-with-masked-autoencoders/), but it I don't understand the followidng thing in the examples used in both of them: One result of the masking is that one input is simply not used(?). Another one is that one output node has no conditions, i.e. it does not depend on any of the inputs. But what is its actual output value? Is it random? Constant? If yes, how is it chosen? submitted by /u/lsov2 [link] [comments]  ( 60 min )
    [D] Combining Machine Learning + Expert Knowledge (Question for Agriculture Research)
    Hey guys, I am working the sector of computer science for agriculture research. I deal here with algorithm to monitor crop conditions and try to simulate what yield will be the outcome. I am focussing on ML based methods, but data in agriculture can be a quite limiting factor. If you have 100k samples from real crop fields, thats a lot! So we are not like ChatGPT, who just used 500bn word samples to train their model. To overcome the issues of small data + ML, I want to set up an approach that combines ML methods (learning from data) with expert knowledge. What do I mean by this: E.g. Everybody knows, if you do not water your plant, it will die. Or if there are 90° Celsius, the plant will just burn. This knowledge is partially stored in so called "crop simulation models" designed by agronomy experts and my idea was to use these expert models to generate synthetic yield data and feed this data into the training dataset for the ML models. For me that will somehow result in an approach of "constrained machine learning" where I want to combine both. However, does some of you have any other idea how ML and expert models could be combined or the knowledge could be injected to ML methods, except via the training dataset? I am happy to hear your suggestions! submitted by /u/Tigmib [link] [comments]  ( 61 min )
    [D] Mtruk alternatives for extracting information out of text
    I need some validation samples for an information extraction task, basically extracting a list of objects with 4 fields from a text (+ a binary flag). I intended to use mturk for this, but they seem to have some billing issues and I haven't managed to have them allow us to actually spend any money in a week. I've looked at a few alternatives but most seem very small and focused on simple tasks and surveys. Have any of you successfully used something other than mturk for this kind of task? submitted by /u/elcric_krej [link] [comments]  ( 58 min )
    [D] Is there a community for ACL2023 authors?
    Just wondering is there a community like a telegram or discord group for the ACL 2023 authors to share information. submitted by /u/OneMasterpiece1717 [link] [comments]  ( 59 min )
    Why is Super Learning / Stacking used rather rarely in practice? [D]
    Basically what the titel says. For me it seems that neither in business nor in literature Super Learners / Stacking is used frequently. Therefore I was wondering why this is the case? Especially since Stacking should guarantee at least equal performance as the base learners used for it. One reason that comes up my mind is the curse of data. As more levels in the architecture we have the more data splits are needed, reducing the available training data for each individual model, thus reducing the model performance. Another thing might be the complexity when building a Stacked Learner. Still that doesn’t see to be that bad of a trade-off. Anything I‘m totally missing here? submitted by /u/Worth-Advance-1232 [link] [comments]  ( 58 min )
    [D] Bitter lesson 2.0?
    This twitter thread from Karol Hausman talks about the original bitter lesson and suggests a bitter lesson 2.0. https://twitter.com/hausman_k/status/1612509549889744899 "The biggest lesson that [will] be read from [the next] 70 years of AI research is that general methods that leverage foundation models are ultimately the most effective" Seems to be derived by observing that the most promising work in robotics today (where generating data is challenging) is coming from piggy-backing on the success of large language models (think SayCan etc). Any hot takes? submitted by /u/Tea_Pearce [link] [comments]  ( 64 min )
    [N] VizWiz Launches 4 AI Challenges to help blind/low vision community
    Greetings! We are pleased to announce the fourth annual VizWiz Grand Challenge workshop, which will be held in conjunction with CVPR 2023. The workshop is running 4 AI Challenges to drive the development of assistive technologies for people who are blind or low-vision. Please share this post with those who might be interested in participating. This workshop is motivated in part by our observation that people who are blind have relied on (human-based) visual assistance services to learn about images and videos they capture for over a decade. We introduce visual question answering, few shot recognition, and object localization dataset challenges for the AI community to represent authentic use cases. A few more details: · Friday, May 5: submissions of algorithm results due to the evaluation server · Monday, June 19: results will be announced at the VizWiz Grand Challenge workshop at CVPR 2023 · VQA Challenge here · VQA Grounding Challenge here · Few-Shot Object Recognition Challenge here · Salient Object Detection Challenge here We are looking forward to your participation in the Challenges this year! submitted by /u/eee-vaaah [link] [comments]  ( 58 min )
  • Open

    SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration. (arXiv:2211.13743v3 [cs.LG] UPDATED)
    The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.  ( 2 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    Improving ECG-based COVID-19 diagnosis and mortality predictions using pre-pandemic medical records at population-scale. (arXiv:2211.10431v2 [eess.SP] UPDATED)
    Pandemic outbreaks such as COVID-19 occur unexpectedly, and need immediate action due to their potential devastating consequences on global health. Point-of-care routine assessments such as electrocardiogram (ECG), can be used to develop prediction models for identifying individuals at risk. However, there is often too little clinically-annotated medical data, especially in early phases of a pandemic, to develop accurate prediction models. In such situations, historical pre-pandemic health records can be utilized to estimate a preliminary model, which can then be fine-tuned based on limited available pandemic data. This study shows this approach -- pre-train deep learning models with pre-pandemic data -- can work effectively, by demonstrating substantial performance improvement over three different COVID-19 related diagnostic and prognostic prediction tasks. Similar transfer learning strategies can be useful for developing timely artificial intelligence solutions in future pandemic outbreaks.  ( 2 min )
    NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation. (arXiv:2208.05117v3 [cs.LG] UPDATED)
    Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non-i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption. Code is available at https://github.com/TaesikGong/NOTE.  ( 2 min )
    Composite Feature Selection using Deep Ensembles. (arXiv:2211.00631v2 [cs.LG] UPDATED)
    In many real world problems, features do not act alone but in combination with each other. For example, in genomics, diseases might not be caused by any single mutation but require the presence of multiple mutations. Prior work on feature selection either seeks to identify individual features or can only determine relevant groups from a predefined set. We investigate the problem of discovering groups of predictive features without predefined grouping. To do so, we define predictive groups in terms of linear and non-linear interactions between features. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups, without requiring candidate groups to be provided. The selected groups are sparse and exhibit minimum overlap. Furthermore, we propose a new metric to measure similarity between discovered groups and the ground truth. We demonstrate the utility of our model on multiple synthetic tasks and semi-synthetic chemistry datasets, where the ground truth structure is known, as well as an image dataset and a real-world cancer dataset.  ( 2 min )
    Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification. (arXiv:2211.03413v2 [cs.LG] UPDATED)
    In the field of reinforcement learning, because of the high cost and risk of policy training in the real world, policies are trained in a simulation environment and transferred to the corresponding real-world environment. However, the simulation environment does not perfectly mimic the real-world environment, lead to model misspecification. Multiple studies report significant deterioration of policy performance in a real-world environment. In this study, we focus on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values, called the uncertainty parameter set. The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment. To obtain a policy for the optimization, we propose an off-policy actor-critic approach called the Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach. Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.  ( 2 min )
    Learning Graph Search Heuristics. (arXiv:2212.03978v2 [cs.LG] UPDATED)
    Searching for a path between two nodes in a graph is one of the most well-studied and fundamental problems in computer science. In numerous domains such as robotics, AI, or biology, practitioners develop search heuristics to accelerate their pathfinding algorithms. However, it is a laborious and complex process to hand-design heuristics based on the problem and the structure of a given use case. Here we present PHIL (Path Heuristic with Imitation Learning), a novel neural architecture and a training algorithm for discovering graph search and navigation heuristics from data by leveraging recent advances in imitation learning and graph representation learning. At training time, we aggregate datasets of search trajectories and ground-truth shortest path distances, which we use to train a specialized graph neural network-based heuristic function using backpropagation through steps of the pathfinding process. Our heuristic function learns graph embeddings useful for inferring node distances, runs in constant time independent of graph sizes, and can be easily incorporated in an algorithm such as A* at test time. Experiments show that PHIL reduces the number of explored nodes compared to state-of-the-art methods on benchmark datasets by 58.5\% on average, can be directly applied in diverse graphs ranging from biological networks to road networks, and allows for fast planning in time-critical robotics domains.  ( 2 min )
    Intra-session Context-aware Feed Recommendation in Live Systems. (arXiv:2210.07815v2 [cs.IR] UPDATED)
    Feed recommendation allows users to constantly browse items until feel uninterested and leave the session, which differs from traditional recommendation scenarios. Within a session, user's decision to continue browsing or not substantially affects occurrences of later clicks. However, such type of exposure bias is generally ignored or not explicitly modeled in most feed recommendation studies. In this paper, we model this effect as part of intra-session context, and propose a novel intra-session Context-aware Feed Recommendation (INSCAFER) framework to maximize the total views and total clicks simultaneously. User click and browsing decisions are jointly learned by a multi-task setting, and the intra-session context is encoded by the session-wise exposed item sequence. We deploy our model online with all key business benchmarks improved. Our method sheds some lights on feed recommendation studies which aim to optimize session-level click and view metrics.  ( 2 min )
    Auto-Encoder Neural Network Incorporating X-Ray Fluorescence Fundamental Parameters with Machine Learning. (arXiv:2210.12239v2 [cs.LG] UPDATED)
    We consider energy-dispersive X-ray Fluorescence (EDXRF) applications where the fundamental parameters method is impractical such as when instrument parameters are unavailable. For example, on a mining shovel or conveyor belt, rocks are constantly moving (leading to varying angles of incidence and distances) and there may be other factors not accounted for (like dust). Neural networks do not require instrument and fundamental parameters but training neural networks requires XRF spectra labelled with elemental composition, which is often limited because of its expense. We develop a neural network model that learns from limited labelled data and learns to invert a forward model. The forward model uses transition energies and probabilities of all elements and parameterized distributions to approximate other fundamental and instrument parameters. We evaluate the model and baseline models on a rock dataset from a lithium mineral exploration project and identify which elements are appropriate for this method. This model demonstrates the potential to calibrate a neural network in a noisy environment where labelled data is limited.  ( 2 min )
    Exploration of Parameter Spaces Assisted by Machine Learning. (arXiv:2207.09959v3 [hep-ph] UPDATED)
    We demonstrate two sampling procedures assisted by machine learning models via regression and classification. The main objective is the use of a neural network to suggest points likely inside regions of interest, reducing the number of evaluations of time consuming calculations. We compare results from this approach with results from other sampling methods, namely Markov chain Monte Carlo and MultiNest, obtaining results that range from comparably similar to arguably better. In particular, we augment our classifier method with a boosting technique that rapidly increases the efficiency within a few iterations. We show results from our methods applied to a toy model and the type II 2HDM, using 3 and 7 free parameters, respectively. The code used for this paper and instructions are publicly available on the web.  ( 2 min )
    OneRing: A Simple Method for Source-free Open-partial Domain Adaptation. (arXiv:2206.03600v2 [cs.CV] UPDATED)
    In this paper, we investigate Source-free Open-partial Domain Adaptation (SF-OPDA), which addresses the situation where there exist both domain and category shifts between source and target domains. Under the SF-OPDA setting, which aims to address data privacy concerns, the model cannot access source data anymore during target adaptation. We propose a novel training scheme to learn a (n+1)-way classifier to predict the n source classes and the unknown class, where samples of only known source categories are available for training. Furthermore, for target adaptation, we simply adopt a weighted entropy minimization to adapt the source pretrained model to the unlabeled target domain without source data. In experiments, we show our simple method surpasses current OPDA approaches which demand source data during adaptation. When augmented with a closed-set domain adaptation approach during target adaptation, our source-free method further outperforms the current state-of-the-art OPDA method by 2.5%, 7.2% and 13% on Office-31, Office-Home and VisDA respectively.  ( 2 min )
    ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer. (arXiv:2204.10777v2 [cs.CV] UPDATED)
    The problem of multimodal intent and trajectory prediction for human-driven vehicles in parking lots is addressed in this paper. Using models designed with CNN and Transformer networks, we extract temporal-spatial and contextual information from trajectory history and local bird's eye view (BEV) semantic images, and generate predictions about intent distribution and future trajectory sequences. Our methods outperform existing models in accuracy, while allowing an arbitrary number of modes, encoding complex multi-agent scenarios, and adapting to different parking maps. To train and evaluate our method, we present the first public 4K video dataset of human driving in parking lots with accurate annotation, high frame rate, and rich traffic scenarios.  ( 2 min )
    Contrastive Neural Ratio Estimation. (arXiv:2210.06170v2 [stat.ML] UPDATED)
    Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.  ( 2 min )
    Learning Invariant Representations under General Interventions on the Response. (arXiv:2208.10027v2 [stat.ME] UPDATED)
    It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we introduce invariant matching property (IMP) which is an explicit relation to capture interventions through an additional feature. This leads to an alternative form of invariance that enables a unified treatment of general interventions on the response. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings including a COVID dataset.  ( 2 min )
    A learning theory for quantum photonic processors and beyond. (arXiv:2209.03075v2 [quant-ph] UPDATED)
    We consider the tasks of learning quantum states, measurements and channels generated by continuous-variable (CV) quantum circuits. This family of circuits is suited to describe optical quantum technologies and in particular it includes state-of-the-art photonic processors capable of showing quantum advantage. We define classes of functions that map classical variables, encoded into the CV circuit parameters, to outcome probabilities evaluated on those circuits. We then establish efficient learnability guarantees for such classes, by computing bounds on their pseudo-dimension or covering numbers, showing that CV quantum circuits can be learned with a sample complexity that scales polynomially with the circuit's size, i.e., the number of modes. Our results establish that CV circuits can be trained efficiently using a number of training samples that, unlike their finite-dimensional counterpart, does not scale with the circuit depth.  ( 2 min )
    Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization. (arXiv:2207.13676v2 [cs.LG] UPDATED)
    Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS) Vizier, a standalone Python-based interface for blackbox optimization and research, based on the Google-internal Vizier infrastructure and framework. OSS Vizier provides an API capable of defining and solving a wide variety of optimization problems, including multi-metric, early stopping, transfer learning, and conditional search. Furthermore, it is designed to be a distributed system that assures reliability, and allows multiple parallel evaluations of the user's objective function. The flexible RPC-based infrastructure allows users to access OSS Vizier from binaries written in any language. OSS Vizier also provides a back-end ("Pythia") API that gives algorithm authors a way to interface new algorithms with the core OSS Vizier system. OSS Vizier is available at https://github.com/google/vizier.  ( 2 min )
    Expected Frequency Matrices of Elections: Computation, Geometry, and Preference Learning. (arXiv:2205.07831v2 [cs.GT] UPDATED)
    We use the ``map of elections'' approach of Szufa et al. (AAMAS-2020) to analyze several well-known vote distributions. For each of them, we give an explicit formula or an efficient algorithm for computing its frequency matrix, which captures the probability that a given candidate appears in a given position in a sampled vote. We use these matrices to draw the ``skeleton map'' of distributions, evaluate its robustness, and analyze its properties. Finally, we develop a general and unified framework for learning the distribution of real-world preferences using the frequency matrices of established vote distributions.  ( 2 min )
    Learning Neural Set Functions Under the Optimal Subset Oracle. (arXiv:2203.01693v3 [cs.LG] UPDATED)
    Learning neural set functions becomes increasingly more important in many applications like product recommendation and compound selection in AI-aided drug discovery. The majority of existing works study methodologies of set function learning under the function value oracle, which, however, requires expensive supervision signals. This renders it impractical for applications with only weak supervisions under the Optimal Subset (OS) oracle, the study of which is surprisingly overlooked. In this work, we present a principled yet practical maximum likelihood learning framework, termed as EquiVSet, that simultaneously meets the following desiderata of learning set functions under the OS oracle: i) permutation invariance of the set mass function being modeled; ii) permission of varying ground set; iii) minimum prior; and iv) scalability. The main components of our framework involve: an energy-based treatment of the set mass function, DeepSet-style architectures to handle permutation invariance, mean-field variational inference, and its amortized variants. Thanks to the elegant combination of these advanced architectures, empirical studies on three real-world applications (including Amazon product recommendation, set anomaly detection, and compound selection for virtual screening) demonstrate that EquiVSet outperforms the baselines by a large margin.  ( 2 min )
    Bitwidth Heterogeneous Federated Learning with Progressive Weight Dequantization. (arXiv:2202.11453v5 [cs.LG] UPDATED)
    In practical federated learning scenarios, the participating devices may have different bitwidths for computation and memory storage by design. However, despite the progress made in device-heterogeneous federated learning scenarios, the heterogeneity in the bitwidth specifications in the hardware has been mostly overlooked. We introduce a pragmatic FL scenario with bitwidth heterogeneity across the participating devices, dubbed as Bitwidth Heterogeneous Federated Learning (BHFL). BHFL brings in a new challenge, that the aggregation of model parameters with different bitwidths could result in severe performance degeneration, especially for high-bitwidth models. To tackle this problem, we propose ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and finally into full-precision weights. ProWD further selectively aggregates the model parameters to maximize the compatibility across bit-heterogeneous weights. We validate ProWD against relevant FL baselines on the benchmark datasets, using clients with varying bitwidths. Our ProWD largely outperforms the baseline FL algorithms as well as naive approaches (e.g. grouped averaging) under the proposed BHFL scenario.  ( 2 min )
    Padding Module: Learning the Padding in Deep Neural Networks. (arXiv:2301.04608v1 [cs.CV])
    During the last decades, many studies have been dedicated to improving the performance of neural networks, for example, the network architectures, initialization, and activation. However, investigating the importance and effects of learnable padding methods in deep learning remains relatively open. To mitigate the gap, this paper proposes a novel trainable Padding Module that can be placed in a deep learning model. The Padding Module can optimize itself without requiring or influencing the model's entire loss function. To train itself, the Padding Module constructs a ground truth and a predictor from the inputs by leveraging the underlying structure in the input data for supervision. As a result, the Padding Module can learn automatically to pad pixels to the border of its input images or feature maps. The padding contents are realistic extensions to its input data and simultaneously facilitate the deep learning model's downstream task. Experiments have shown that the proposed Padding Module outperforms the state-of-the-art competitors and the baseline methods. For example, the Padding Module has 1.23% and 0.44% more classification accuracy than the zero padding when tested on the VGG16 and ResNet50.  ( 2 min )
    When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning. (arXiv:2206.13464v3 [cs.LG] UPDATED)
    Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.  ( 2 min )
    Reducing Exploitability with Population Based Training. (arXiv:2208.05083v3 [cs.LG] UPDATED)
    Self-play reinforcement learning has achieved state-of-the-art, and often superhuman, performance in a variety of zero-sum games. Yet prior work has found that policies that are highly capable against regular opponents can fail catastrophically against adversarial policies: an opponent trained explicitly against the victim. Prior defenses using adversarial training were able to make the victim robust to a specific adversary, but the victim remained vulnerable to new ones. We conjecture this limitation was due to insufficient diversity of adversaries seen during training. We analyze a defense using population based training to pit the victim against a diverse set of opponents. We evaluate this defense's robustness against new adversaries in two low-dimensional environments. This defense increases robustness against adversaries, as measured by the number of attacker training timesteps to exploit the victim. Furthermore, we show that robustness is correlated with the size of the opponent population.  ( 2 min )
    Investigating the Properties of Neural Network Representations in Reinforcement Learning. (arXiv:2203.15955v2 [cs.LG] UPDATED)
    In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the early work on representations for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation -- good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. We introduce and measure six representational properties over more than 25 thousand agent-task settings. We consider Deep Q-learning agents with different auxiliary losses in a pixel-based navigation environment, with source and transfer tasks corresponding to different goal locations. We develop a method to better understand why some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance. We demonstrate the generality of the methodology by investigating representations learned by a Rainbow agent that successfully transfer across games modes in Atari 2600.  ( 2 min )
    Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification. (arXiv:2206.09843v3 [cs.CV] UPDATED)
    Recent years have seen a growth in user-centric applications that require effective knowledge transfer across tasks in the low-data regime. An example is personalization, where a pretrained system is adapted by learning on small amounts of labeled data belonging to a specific user. This setting requires high accuracy under low computational complexity, therefore the Pareto frontier of accuracy vs. adaptation cost plays a crucial role. In this paper we push this Pareto frontier in the few-shot image classification setting with a key contribution: a new adaptive block called Contextual Squeeze-and-Excitation (CaSE) that adjusts a pretrained neural network on a new task to significantly improve performance with a single forward pass of the user data (context). We use meta-trained CaSE blocks to conditionally adapt the body of a network and a fine-tuning routine to adapt a linear head, defining a method called UpperCaSE. UpperCaSE achieves a new state-of-the-art accuracy relative to meta-learners on the 26 datasets of VTAB+MD and on a challenging real-world personalization benchmark (ORBIT), narrowing the gap with leading fine-tuning methods with the benefit of orders of magnitude lower adaptation cost.  ( 2 min )
    SnAKe: Bayesian Optimization with Pathwise Exploration. (arXiv:2202.00060v4 [cs.LG] UPDATED)
    Bayesian Optimization is a very effective tool for optimizing expensive black-box functions. Inspired by applications developing and characterizing reaction chemistry using droplet microfluidic reactors, we consider a novel setting where the expense of evaluating the function can increase significantly when making large input changes between iterations. We further assume we are working asynchronously, meaning we have to select new queries before evaluating previous experiments. This paper investigates the problem and introduces 'Sequential Bayesian Optimization via Adaptive Connecting Samples' (SnAKe), which provides a solution by considering large batches of queries and preemptively building optimization paths that minimize input costs. We investigate some convergence properties and empirically show that the algorithm is able to achieve regret similar to classical Bayesian Optimization algorithms in both synchronous and asynchronous settings, while reducing input costs significantly. We show the method is robust to the choice of its single hyper-parameter and provide a parameter-free alternative.  ( 2 min )
    Towards Backdoor Attacks and Defense in Robust Machine Learning Models. (arXiv:2003.00865v4 [cs.CV] UPDATED)
    The introduction of robust optimisation has pushed the state-of-the-art in defending against adversarial attacks. Notably, the state-of-the-art projected gradient descent (PGD)-based training method has been shown to be universally and reliably effective in defending against adversarial inputs. This robustness approach uses PGD as a reliable and universal "first-order adversary". However, the behaviour of such optimisation has not been studied in the light of a fundamentally different class of attacks called backdoors. In this paper, we study how to inject and defend against backdoor attacks for robust models trained using PGD-based robust optimisation. We demonstrate that these models are susceptible to backdoor attacks. Subsequently, we observe that backdoors are reflected in the feature representation of such models. Then, this observation is leveraged to detect such backdoor-infected models via a detection technique called AEGIS. Specifically, given a robust Deep Neural Network (DNN) that is trained using PGD-based first-order adversarial training approach, AEGIS uses feature clustering to effectively detect whether such DNNs are backdoor-infected or clean. In our evaluation of several visible and hidden backdoor triggers on major classification tasks using CIFAR-10, MNIST and FMNIST datasets, AEGIS effectively detects PGD-trained robust DNNs infected with backdoors. AEGIS detects such backdoor-infected models with 91.6% accuracy (11 out of 12 tested models), without any false positives. Furthermore, AEGIS detects the targeted class in the backdoor-infected model with a reasonably low (11.1%) false positive rate. Our investigation reveals that salient features of adversarially robust DNNs could be promising to break the stealthy nature of backdoor attacks.  ( 3 min )
    Towards a unified nonlocal, peridynamics framework for the coarse-graining of molecular dynamics data with fractures. (arXiv:2301.04540v1 [cond-mat.mtrl-sci])
    Molecular dynamics (MD) has served as a powerful tool for designing materials with reduced reliance on laboratory testing. However, the use of MD directly to treat the deformation and failure of materials at the mesoscale is still largely beyond reach. Herein, we propose a learning framework to extract a peridynamic model as a mesoscale continuum surrogate from MD simulated material fracture datasets. Firstly, we develop a novel coarse-graining method, to automatically handle the material fracture and its corresponding discontinuities in MD displacement dataset. Inspired by the Weighted Essentially Non-Oscillatory scheme, the key idea lies at an adaptive procedure to automatically choose the locally smoothest stencil, then reconstruct the coarse-grained material displacement field as piecewise smooth solutions containing discontinuities. Then, based on the coarse-grained MD data, a two-phase optimization-based learning approach is proposed to infer the optimal peridynamics model with damage criterion. In the first phase, we identify the optimal nonlocal kernel function from datasets without material damage, to capture the material stiffness properties. Then, in the second phase, the material damage criterion is learnt as a smoothed step function from the data with fractures. As a result, a peridynamics surrogate is obtained. Our peridynamics surrogate model can be employed in further prediction tasks with different grid resolutions from training, and hence allows for substantial reductions in computational cost compared with MD. We illustrate the efficacy of the proposed approach with several numerical tests for single layer graphene. Our tests show that the proposed data-driven model is robust and generalizable: it is capable in modeling the initialization and growth of fractures under discretization and loading settings that are different from the ones used during training.  ( 2 min )
    Interpretable Hidden Markov Model-Based Deep Reinforcement Learning Hierarchical Framework for Predictive Maintenance of Turbofan Engines. (arXiv:2206.13433v2 [cs.LG] UPDATED)
    An open research question in deep reinforcement learning is how to focus the policy learning of key decisions within a sparse domain. This paper emphasizes combining the advantages of inputoutput hidden Markov models and reinforcement learning towards interpretable maintenance decisions. We propose a novel hierarchical-modeling methodology that, at a high level, detects and interprets the root cause of a failure as well as the health degradation of the turbofan engine, while, at a low level, it provides the optimal replacement policy. It outperforms the baseline performance of deep reinforcement learning methods applied directly to the raw data or when using a hidden Markov model without such a specialized hierarchy. It also provides comparable performance to prior work, however, with the additional benefit of interpretability.  ( 2 min )
    FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v4 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that identifies and suppresses those data sources that would have a negative impact on fairness or accuracy if they were used for training. As such, FLEA is not a replacement of prior fairness-aware learning methods but rather an augmentation that makes any of them robust against unreliable training data. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that -- given enough data -- FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half. Our source code and documentation are available at https://github.com/ISTAustria-CVML/FLEA.  ( 2 min )
    Convex Analysis at Infinity: An Introduction to Astral Space. (arXiv:2205.03260v2 [math.OC] UPDATED)
    Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.  ( 2 min )
    Efficient Natural Gradient Descent Methods for Large-Scale PDE-Based Optimization Problems. (arXiv:2202.06236v3 [math.OC] UPDATED)
    We propose efficient numerical schemes for implementing the natural gradient descent (NGD) for a broad range of metric spaces with applications to PDE-based optimization problems. Our technique represents the natural gradient direction as a solution to a standard least-squares problem. Hence, instead of calculating, storing, or inverting the information matrix directly, we apply efficient methods from numerical linear algebra. We treat both scenarios where the Jacobian, i.e., the derivative of the state variable with respect to the parameter, is either explicitly known or implicitly given through constraints. We can thus reliably compute several natural NGDs for a large-scale parameter space. In particular, we are able to compute Wasserstein NGD in thousands of dimensions, which was believed to be out of reach. Finally, our numerical results shed light on the qualitative differences between the standard gradient descent and various NGD methods based on different metric spaces in nonconvex optimization problems.  ( 2 min )
    Causal Discovery from Sparse Time-Series Data Using Echo State Network. (arXiv:2201.02933v2 [cs.LG] UPDATED)
    Causal discovery between collections of time-series data can help diagnose causes of symptoms and hopefully prevent faults before they occur. However, reliable causal discovery can be very challenging, especially when the data acquisition rate varies (i.e., non-uniform data sampling), or in the presence of missing data points (e.g., sparse data sampling). To address these issues, we proposed a new system comprised of two parts, the first part fills missing data with a Gaussian Process Regression, and the second part leverages an Echo State Network, which is a type of reservoir computer (i.e., used for chaotic system modelling) for Causal discovery. We evaluate the performance of our proposed system against three other off-the-shelf causal discovery algorithms, namely, structural expectation-maximization, sub-sampled linear auto-regression absolute coefficients, and multivariate Granger Causality with vector auto-regressive using the Tennessee Eastman chemical dataset; we report on their corresponding Matthews Correlation Coefficient(MCC) and Receiver Operating Characteristic curves (ROC) and show that the proposed system outperforms existing algorithms, demonstrating the viability of our approach to discover causal relationships in a complex system with missing entries.  ( 2 min )
    "You Can't Fix What You Can't Measure": Privately Measuring Demographic Performance Disparities in Federated Learning. (arXiv:2206.12183v2 [cs.LG] UPDATED)
    As in traditional machine learning models, models trained with federated learning may exhibit disparate performance across demographic groups. Model holders must identify these disparities to mitigate undue harm to the groups. However, measuring a model's performance in a group requires access to information about group membership which, for privacy reasons, often has limited availability. We propose novel locally differentially private mechanisms to measure differences in performance across groups while protecting the privacy of group membership. To analyze the effectiveness of the mechanisms, we bound their error in estimating a disparity when optimized for a given privacy budget. Our results show that the error rapidly decreases for realistic numbers of participating clients, demonstrating that, contrary to what prior work suggested, protecting privacy is not necessarily in conflict with identifying performance disparities of federated models.  ( 2 min )
    On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games. (arXiv:2109.01795v2 [cs.GT] UPDATED)
    Similar to the role of Markov decision processes in reinforcement learning, Stochastic Games (SGs) lay the foundation for the study of multi-agent reinforcement learning (MARL) and sequential agent interactions. In this paper, we derive that computing an approximate Markov Perfect Equilibrium (MPE) in a finite-state discounted Stochastic Game within the exponential precision is \textbf{PPAD}-complete. We adopt a function with a polynomially bounded description in the strategy space to convert the MPE computation to a fixed-point problem, even though the stochastic game may demand an exponential number of pure strategies, in the number of states, for each agent. The completeness result follows the reduction of the fixed-point problem to {\sc End of the Line}. Our results indicate that finding an MPE in SGs is highly unlikely to be \textbf{NP}-hard unless \textbf{NP}=\textbf{co-NP}. Our work offers confidence for MARL research to study MPE computation on general-sum SGs and to develop fruitful algorithms as currently on zero-sum SGs.  ( 2 min )
    When does return-conditioned supervised learning work for offline reinforcement learning?. (arXiv:2206.01079v3 [cs.LG] UPDATED)
    Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.  ( 2 min )
    EINNs: Epidemiologically-informed Neural Networks. (arXiv:2202.10446v2 [cs.LG] UPDATED)
    We introduce EINNs, a framework crafted for epidemic forecasting that builds upon the theoretical grounds provided by mechanistic models as well as the data-driven expressibility afforded by AI models, and their capabilities to ingest heterogeneous information. Although neural forecasting models have been successful in multiple tasks, predictions well-correlated with epidemic trends and long-term predictions remain open challenges. Epidemiological ODE models contain mechanisms that can guide us in these two tasks; however, they have limited capability of ingesting data sources and modeling composite signals. Thus, we propose to leverage work in physics-informed neural networks to learn latent epidemic dynamics and transfer relevant knowledge to another neural network which ingests multiple data sources and has more appropriate inductive bias. In contrast with previous work, we do not assume the observability of complete dynamics and do not need to numerically solve the ODE equations during training. Our thorough experiments on all US states and HHS regions for COVID-19 and influenza forecasting showcase the clear benefits of our approach in both short-term and long-term forecasting as well as in learning the mechanistic dynamics over other non-trivial alternatives.  ( 2 min )
    Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging. (arXiv:2205.08576v2 [cs.CV] UPDATED)
    The collection and curation of large-scale medical datasets from multiple institutions is essential for training accurate deep learning models, but privacy concerns often hinder data sharing. Federated learning (FL) is a promising solution that enables privacy-preserving collaborative learning among different institutions, but it generally suffers from performance deterioration due to heterogeneous data distributions and a lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Our method introduces a novel Transformer-based self-supervised pre-training paradigm that pre-trains models directly on decentralized target task datasets using masked image modeling, to facilitate more robust representation learning on heterogeneous data and effective knowledge transfer to downstream models. Extensive empirical results on simulated and real-world medical imaging non-IID federated datasets show that masked image modeling with Transformers significantly improves the robustness of models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared to the supervised baseline with ImageNet pre-training. In addition, we show that our federated self-supervised pre-training methods yield models that generalize better to out-of-distribution data and perform more effectively when fine-tuning with limited labeled data, compared to existing FL algorithms. The code is available at https://github.com/rui-yan/SSL-FL.  ( 2 min )
    Fast Multi-view Clustering via Ensembles: Towards Scalability, Superiority, and Simplicity. (arXiv:2203.11572v2 [cs.LG] UPDATED)
    Despite significant progress, there remain three limitations to the previous multi-view clustering algorithms. First, they often suffer from high computational complexity, restricting their feasibility for large-scale datasets. Second, they typically fuse multi-view information via one-stage fusion, neglecting the possibilities in multi-stage fusions. Third, dataset-specific hyperparameter-tuning is frequently required, further undermining their practicability. In light of this, we propose a fast multi-view clustering via ensembles (FastMICE) approach. Particularly, the concept of random view groups is presented to capture the versatile view-wise relationships, through which the hybrid early-late fusion strategy is designed to enable efficient multi-stage fusions. With multiple views extended to many view groups, three levels of diversity (w.r.t. features, anchors, and neighbors, respectively) are jointly leveraged for constructing the view-sharing bipartite graphs in the early-stage fusion. Then, a set of diversified base clusterings for different view groups are obtained via fast graph partitioning, which are further formulated into a unified bipartite graph for final clustering in the late-stage fusion. Notably, FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning. Experiments on 22 multi-view datasets demonstrate its advantages in scalability (for extremely large datasets), superiority (in clustering performance), and simplicity (to be applied) over the state-of-the-art. Code available: https://github.com/huangdonghere/FastMICE.  ( 2 min )
    Convex Surrogate Loss Functions for Contextual Pricing with Transaction Data. (arXiv:2202.10944v2 [cs.LG] UPDATED)
    We study an off-policy contextual pricing problem where the seller has access to samples of prices that customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the previous pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this setting that can be directly optimized to find an effective pricing policy with expected revenue guarantees, without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We propose generalized hinge and quantile pricing loss functions that price at a multiplicative factor of the conditional expected valuation or a particular quantile of the prices that sold, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and we provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.  ( 2 min )
    Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence. (arXiv:2105.11066v4 [cs.LG] UPDATED)
    Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer. Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (arXiv:2102.00135), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.  ( 2 min )
    Pruning Compact ConvNets for Efficient Inference. (arXiv:2301.04502v1 [cs.CV])
    Neural network pruning is frequently used to compress over-parameterized networks by large amounts, while incurring only marginal drops in generalization performance. However, the impact of pruning on networks that have been highly optimized for efficient inference has not received the same level of attention. In this paper, we analyze the effect of pruning for computer vision, and study state-of-the-art ConvNets, such as the FBNetV3 family of models. We show that model pruning approaches can be used to further optimize networks trained through NAS (Neural Architecture Search). The resulting family of pruned models can consistently obtain better performance than existing FBNetV3 models at the same level of computation, and thus provide state-of-the-art results when trading off between computational complexity and generalization performance on the ImageNet benchmark. In addition to better generalization performance, we also demonstrate that when limited computation resources are available, pruning FBNetV3 models incur only a fraction of GPU-hours involved in running a full-scale NAS.  ( 2 min )
    SemSup: Semantic Supervision for Simple and Scalable Zero-shot Generalization. (arXiv:2202.13100v3 [cs.LG] UPDATED)
    Zero-shot learning is the problem of predicting instances over classes not seen during training. One approach to zero-shot learning is providing auxiliary class information to the model. Prior works along this vein have largely used expensive per-instance annotation or singular class-level descriptions, but per-instance descriptions are hard to scale and single class descriptions may not be rich enough. Furthermore, these works have used natural-language descriptions exclusively, simple biencoders models, and modality or task specific methods. These approaches have several limitations: text supervision may not always be available or optimal and biencoders may only learn coarse relations between inputs and class descriptions. In this work, we present SemSup, a novel approach that uses (1) a scalable multiple description sampling method which improves performance over single descriptions, (2) alternative description formats such as JSON that are easy to generate and outperform text on certain settings, and (3) hybrid lexical-semantic similarity to leverage fine-grained information in class descriptions. We demonstrate the effectiveness of SemSup across four datasets, two modalities, and three generalization settings. For example, across text and image datasets, SemSup increases unseen class generalization accuracy by 15 points on average compared to the closest baseline.  ( 2 min )
    DA-MUSIC: Data-Driven DoA Estimation via Deep Augmented MUSIC Algorithm. (arXiv:2109.10581v5 [eess.SP] UPDATED)
    Direction of arrival (DoA) estimation of multiple signals is pivotal in sensor array signal processing. A popular multi-signal DoA estimation method is the multiple signal classification (MUSIC) algorithm, which enables high-performance super-resolution DoA recovery while being highly applicable in practice. MUSIC is a model-based algorithm, relying on an accurate mathematical description of the relationship between the signals and the measurements and assumptions on the signals themselves (non-coherent, narrowband sources). As such, it is sensitive to model imperfections. In this work we propose to overcome these limitations of MUSIC by augmenting the algorithm with specifically designed neural architectures. Our proposed deep augmented MUSIC (DA-MUSIC) algorithm is thus a hybrid model-based/data-driven DoA estimator, which leverages data to improve performance and robustness while preserving the interpretable flow of the classic method. DA-MUSIC is shown to learn to overcome limitations of the purely model-based method, such as its inability to successfully localize coherent sources as well as estimate the number of coherent signal sources present. We further demonstrate the superior resolution of the DA-MUSIC algorithm in synthetic narrowband and broadband scenarios as well as with real-world data of DoA estimation from seismic signals.  ( 2 min )
    Self-Supervised Learning for Biological Sample Localization in 3D Tomographic Images. (arXiv:2011.03353v2 [cs.CV] UPDATED)
    In synchrotron-based Computed Tomography (CT) there is a trade-off between spatial resolution, field of view and speed of positioning and alignment of samples. The problem is even more prominent for high-throughput tomography--an automated setup, capable of scanning large batches of samples without human interaction. As a result, in many applications, only 20-30% of the reconstructed volume contains the actual sample. Such data redundancy clutters the storage and increases processing time. Hence, an automated sample localization becomes an important practical problem. In this work, we describe two self-supervised losses designed for biological CT. We further demonstrate how to employ the uncertainty estimation for sample localization. This approach shows the ability to localize a sample with less than 1.5\% relative error and reduce the used storage by a factor of four. We also show that one of the proposed losses works reasonably well as a pre-training task for the semantic segmentation.  ( 2 min )
    Assessing the Early Bird Heuristic (for Predicting Project Quality). (arXiv:2105.11082v4 [cs.SE] UPDATED)
    Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that the information in those projects "clump" towards the earliest parts of the project. A quality prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this "early bird" data, we can build models very quickly and very early in the project life cycle. Moreover, using this early bird method, we have shown that a simple model (with just a few features) generalizes to hundreds of projects. Based on this experience, we doubt that prior work on generalizing quality models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are available here: https://github.com/snaraya7/early-bird  ( 2 min )
    Adversarial training with informed data selection. (arXiv:2301.04472v1 [cs.LG])
    With the increasing amount of available data and advances in computing capabilities, deep neural networks (DNNs) have been successfully employed to solve challenging tasks in various areas, including healthcare, climate, and finance. Nevertheless, state-of-the-art DNNs are susceptible to quasi-imperceptible perturbed versions of the original images -- adversarial examples. These perturbations of the network input can lead to disastrous implications in critical areas where wrong decisions can directly affect human lives. Adversarial training is the most efficient solution to defend the network against these malicious attacks. However, adversarial trained networks generally come with lower clean accuracy and higher computational complexity. This work proposes a data selection (DS) strategy to be applied in the mini-batch training. Based on the cross-entropy loss, the most relevant samples in the batch are selected to update the model parameters in the backpropagation. The simulation results show that a good compromise can be obtained regarding robustness and standard accuracy, whereas the computational complexity of the backpropagation pass is reduced.  ( 2 min )
    Perceive and predict: self-supervised speech representation based loss functions for speech enhancement. (arXiv:2301.04388v1 [cs.SD])
    Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).  ( 2 min )
    A Stochastic Optimization Framework for Fair Risk Minimization. (arXiv:2102.12586v4 [cs.LG] UPDATED)
    Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness algorithms are either impractical in the large-scale setting because they require large batches of data at each iteration or they are not guaranteed to converge. In this paper, we develop the first stochastic in-processing fairness algorithm with guaranteed convergence. For demographic parity, equalized odds, and equal opportunity notions of fairness, we provide slight variations of our algorithm--called FERMI--and prove that each of these variations converges in stochastic optimization with any batch size. Empirically, we show that FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even with minibatch size as small as one. Extensive experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These benefits are especially significant with small batch sizes and for non-binary classification with large number of sensitive attributes, making FERMI a practical fairness algorithm for large-scale problems.  ( 2 min )
    Trajectory Modeling via Random Utility Inverse Reinforcement Learning. (arXiv:2105.12092v2 [cs.AI] UPDATED)
    We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. Cars are detected by sensors placed on sparsely distributed points on the street network of a city. As rational agents, drivers are trying to maximize some reward function unknown to an external observer. We apply the concept of random utility from econometrics to model the unknown reward function as a function of observed and unobserved features. In contrast to current inverse reinforcement learning approaches, we do not assume that agents act according to a stochastic policy; rather, we assume that agents act according to a deterministic optimal policy and show that randomness in data arises because the exact rewards are not fully observed by an external observer. We introduce the concept of extended state to cope with unobserved features and develop a Markov decision process formulation of drivers decisions. We present theoretical results which guarantee the existence of solutions and show that maximum entropy inverse reinforcement learning is a particular case of our approach. Finally, we illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city in Brazil.  ( 2 min )
    Continual Few-Shot Learning Using HyperTransformers. (arXiv:2301.04584v1 [cs.LG])
    We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates a specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of task, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.  ( 2 min )
    Learning fair representation with a parametric integral probability metric. (arXiv:2202.02943v4 [stat.ML] UPDATED)
    As they have a vital effect on social decision-making, AI algorithms should be not only accurate but also fair. Among various algorithms for fairness AI, learning fair representation (LFR), whose goal is to find a fair representation with respect to sensitive variables such as gender and race, has received much attention. For LFR, the adversarial training scheme is popularly employed as is done in the generative adversarial network type algorithms. The choice of a discriminator, however, is done heuristically without justification. In this paper, we propose a new adversarial training scheme for LFR, where the integral probability metric (IPM) with a specific parametric family of discriminators is used. The most notable result of the proposed LFR algorithm is its theoretical guarantee about the fairness of the final prediction model, which has not been considered yet. That is, we derive theoretical relations between the fairness of representation and the fairness of the prediction model built on the top of the representation (i.e., using the representation as the input). Moreover, by numerical experiments, we show that our proposed LFR algorithm is computationally lighter and more stable, and the final prediction model is competitive or superior to other LFR algorithms using more complex discriminators.  ( 2 min )
    Real-time simulation of viscoelastic tissue behavior with physics-guided deep learning. (arXiv:2301.04614v1 [cs.LG])
    Finite element methods (FEM) are popular approaches for simulation of soft tissues with elastic or viscoelastic behavior. However, their usage in real-time applications, such as in virtual reality surgical training, is limited by computational cost. In this application scenario, which typically involves transportable simulators, the computing hardware severely constrains the size or the level of details of the simulated scene. To address this limitation, data-driven approaches have been suggested to simulate mechanical deformations by learning the mapping rules from FEM generated datasets. Herein, we propose a deep learning method for predicting displacement fields of soft tissues with viscoelastic properties. The main contribution of this work is the use of a physics-guided loss function for the optimization of the deep learning model parameters. The proposed deep learning model is based on convolutional (CNN) and recurrent layers (LSTM) to predict spatiotemporal variations. It is augmented with a mass conservation law in the lost function to prevent the generation of physically inconsistent results. The deep learning model is trained on a set of FEM datasets that are generated from a commercially available state-of-the-art numerical neurosurgery simulator. The use of the physics-guided loss function in a deep learning model has led to a better generalization in the prediction of deformations in unseen simulation cases. Moreover, the proposed method achieves a better accuracy over the conventional CNN models, where improvements were observed in unseen tissue from 8% to 30% depending on the magnitude of external forces. It is hoped that the present investigation will help in filling the gap in applying deep learning in virtual reality simulators, hence improving their computational performance (compared to FEM simulations) and ultimately their usefulness.  ( 3 min )
    A Distinct Unsupervised Reference Model From The Environment Helps Continual Learning. (arXiv:2301.04506v1 [cs.LG])
    The existing continual learning methods are mainly focused on fully-supervised scenarios and are still not able to take advantage of unlabeled data available in the environment. Some recent works tried to investigate semi-supervised continual learning (SSCL) settings in which the unlabeled data are available, but it is only from the same distribution as the labeled data. This assumption is still not general enough for real-world applications and restricts the utilization of unsupervised data. In this work, we introduce Open-Set Semi-Supervised Continual Learning (OSSCL), a more realistic semi-supervised continual learning setting in which out-of-distribution (OoD) unlabeled samples in the environment are assumed to coexist with the in-distribution ones. Under this configuration, we present a model with two distinct parts: (i) the reference network captures general-purpose and task-agnostic knowledge in the environment by using a broad spectrum of unlabeled samples, (ii) the learner network is designed to learn task-specific representations by exploiting supervised samples. The reference model both provides a pivotal representation space and also segregates unlabeled data to exploit them more efficiently. By performing a diverse range of experiments, we show the superior performance of our model compared with other competitors and prove the effectiveness of each component of the proposed model.  ( 2 min )
    Exploring the Latent Space of Autoencoders with Interventional Assays. (arXiv:2106.16091v4 [cs.LG] UPDATED)
    Autoencoders exhibit impressive abilities to embed the data manifold into a low-dimensional latent space, making them a staple of representation learning methods. However, without explicit supervision, which is often unavailable, the representation is usually uninterpretable, making analysis and principled progress challenging. We propose a framework, called latent responses, which exploits the locally contractive behavior exhibited by variational autoencoders to explore the learned manifold. More specifically, we develop tools to probe the representation using interventions in the latent space to quantify the relationships between latent variables. We extend the notion of disentanglement to take the learned generative process into account and consequently avoid the limitations of existing metrics that may rely on spurious correlations. Our analyses underscore the importance of studying the causal structure of the representation to improve performance on downstream tasks such as generation, interpolation, and inference of the factors of variation.  ( 2 min )
    Uncertainty Estimation based on Geometric Separation. (arXiv:2301.04452v1 [cs.LG])
    In machine learning, accurately predicting the probability that a specific input is correct is crucial for risk management. This process, known as uncertainty (or confidence) estimation, is particularly important in mission-critical applications such as autonomous driving. In this work, we put forward a novel geometric-based approach for improving uncertainty estimations in machine learning models. Our approach involves using the geometric distance of the current input from existing training inputs as a signal for estimating uncertainty, and then calibrating this signal using standard post-hoc techniques. We demonstrate that our method leads to more accurate uncertainty estimations than recently proposed approaches through extensive evaluation on a variety of datasets and models. Additionally, we optimize our approach so that it can be implemented on large datasets in near real-time applications, making it suitable for time-sensitive scenarios.  ( 2 min )
    Speech Driven Video Editing via an Audio-Conditioned Diffusion Model. (arXiv:2301.04474v1 [cs.CV])
    In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.  ( 2 min )
    Federated Learning under Heterogeneous and Correlated Client Availability. (arXiv:2301.04632v1 [cs.LG])
    The enormous amount of data produced by mobile and IoT devices has motivated the development of federated learning (FL), a framework allowing such devices (or clients) to collaboratively train machine learning models without sharing their local data. FL algorithms (like FedAvg) iteratively aggregate model updates computed by clients on their own datasets. Clients may exhibit different levels of participation, often correlated over time and with other clients. This paper presents the first convergence analysis for a FedAvg-like FL algorithm under heterogeneous and correlated client availability. Our analysis highlights how correlation adversely affects the algorithm's convergence rate and how the aggregation strategy can alleviate this effect at the cost of steering training toward a biased model. Guided by the theoretical analysis, we propose CA-Fed, a new FL algorithm that tries to balance the conflicting goals of maximizing convergence speed and minimizing model bias. To this purpose, CA-Fed dynamically adapts the weight given to each client and may ignore clients with low availability and large correlation. Our experimental results show that CA-Fed achieves higher time-average accuracy and a lower standard deviation than state-of-the-art AdaFed and F3AST, both on synthetic and real datasets.  ( 2 min )
    Determinate Node Selection for Semi-supervised Classification Oriented Graph Convolutional Networks. (arXiv:2301.04381v1 [cs.LG])
    Graph Convolutional Networks (GCNs) have been proved successful in the field of semi-supervised node classification by extracting structural information from graph data. However, the random selection of labeled nodes used by GCNs may lead to unstable generalization performance of GCNs. In this paper, we propose an efficient method for the deterministic selection of labeled nodes: the Determinate Node Selection (DNS) algorithm. The DNS algorithm identifies two categories of representative nodes in the graph: typical nodes and divergent nodes. These labeled nodes are selected by exploring the structure of the graph and determining the ability of the nodes to represent the distribution of data within the graph. The DNS algorithm can be applied quite simply on a wide range of semi-supervised graph neural network models for node classification tasks. Through extensive experimentation, we have demonstrated that the incorporation of the DNS algorithm leads to a remarkable improvement in the average accuracy of the model and a significant decrease in the standard deviation, as compared to the original method.  ( 2 min )
    Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions. (arXiv:2301.04605v1 [cs.LG])
    Multiplication layers are a key component in various influential neural network modules, including self-attention and hypernetwork layers. In this paper, we investigate the approximation capabilities of deep neural networks with intermediate neurons connected by simple multiplication operations. We consider two classes of target functions: generalized bandlimited functions, which are frequently used to model real-world signals with finite bandwidth, and Sobolev-Type balls, which are embedded in the Sobolev Space $\mathcal{W}^{r,2}$. Our results demonstrate that multiplicative neural networks can approximate these functions with significantly fewer layers and neurons compared to standard ReLU neural networks, with respect to both input dimension and approximation error. These findings suggest that multiplicative gates can outperform standard feed-forward layers and have potential for improving neural network design.  ( 2 min )
    Dynamics of a data-driven low-dimensional model of turbulent minimal Couette flow. (arXiv:2301.04638v1 [physics.flu-dyn])
    Because the Navier-Stokes equations are dissipative, the long-time dynamics of a flow in state space are expected to collapse onto a manifold whose dimension may be much lower than the dimension required for a resolved simulation. On this manifold, the state of the system can be exactly described in a coordinate system parameterizing the manifold. Describing the system in this low-dimensional coordinate system allows for much faster simulations and analysis. We show, for turbulent Couette flow, that this description of the dynamics is possible using a data-driven manifold dynamics modeling method. This approach consists of an autoencoder to find a low-dimensional manifold coordinate system and a set of ordinary differential equations defined by a neural network. Specifically, we apply this method to minimal flow unit turbulent plane Couette flow at $\textit{Re}=400$, where a fully resolved solutions requires $\mathcal{O}(10^5)$ degrees of freedom. Using only data from this simulation we build models with fewer than $20$ degrees of freedom that quantitatively capture key characteristics of the flow, including the streak breakdown and regeneration cycle. At short-times, the models track the true trajectory for multiple Lyapunov times, and, at long-times, the models capture the Reynolds stress and the energy balance. For comparison, we show that the models outperform POD-Galerkin models with $\sim$2000 degrees of freedom. Finally, we compute unstable periodic orbits from the models. Many of these closely resemble previously computed orbits for the full system; additionally, we find nine orbits that correspond to previously unknown solutions in the full system.  ( 2 min )
    Fast conformational clustering of extensive molecular dynamics simulation data. (arXiv:2301.04492v1 [physics.chem-ph])
    We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long molecular dynamics simulation trajectories. In this approach we combine two dimensionality reduction algorithms (cc\_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN). The proposed scheme benefits from the strengths of the three algorithms while avoiding most of the drawbacks of the individual methods. Here the cc\_analysis algorithm is for the first time applied to molecular simulation data. Encodermap complements cc\_analysis by providing an efficient way to process and assign large amounts of data to clusters. The main goal of the procedure is to maximize the number of assigned frames of a given trajectory, while keeping a clear conformational identity of the clusters that are found. In practice we achieve this by using an iterative clustering approach and a tunable root-mean-square-deviation-based criterion in the final cluster assignment. This allows to find clusters of different densities as well as different degrees of structural identity. With the help of four test systems we illustrate the capability and performance of this clustering workflow: wild-type and thermostable mutant of the Trp-cage protein (TC5b and TC10b), NTL9 and Protein B. Each of these systems poses individual challenges to the scheme, which in total give a nice overview of the advantages, as well as potential difficulties that can arise when using the proposed method.  ( 3 min )
    Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering. (arXiv:2301.04366v1 [cs.CL])
    We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.  ( 2 min )
    Federated Learning and Blockchain-enabled Fog-IoT Platform for Wearables in Predictive Healthcare. (arXiv:2301.04511v1 [cs.LG])
    Over the years, the popularity and usage of wearable Internet of Things (IoT) devices in several healthcare services are increased. Among the services that benefit from the usage of such devices is predictive analysis, which can improve early diagnosis in e-health. However, due to the limitations of wearable IoT devices, challenges in data privacy, service integrity, and network structure adaptability arose. To address these concerns, we propose a platform using federated learning and private blockchain technology within a fog-IoT network. These technologies have privacy-preserving features securing data within the network. We utilized the fog-IoT network's distributive structure to create an adaptive network for wearable IoT devices. We designed a testbed to examine the proposed platform's ability to preserve the integrity of a classifier. According to experimental results, the introduced implementation can effectively preserve a patient's privacy and a predictive service's integrity. We further investigated the contributions of other technologies to the security and adaptability of the IoT network. Overall, we proved the feasibility of our platform in addressing significant security and privacy challenges of wearable IoT devices in predictive healthcare through analysis, simulation, and experimentation.  ( 2 min )
    BINN: A deep learning approach for computational mechanics problems based on boundary integral equations. (arXiv:2301.04480v1 [cs.LG])
    We proposed the boundary-integral type neural networks (BINN) for the boundary value problems in computational mechanics. The boundary integral equations are employed to transfer all the unknowns to the boundary, then the unknowns are approximated using neural networks and solved through a training process. The loss function is chosen as the residuals of the boundary integral equations. Regularization techniques are adopted to efficiently evaluate the weakly singular and Cauchy principle integrals in boundary integral equations. Potential problems and elastostatic problems are mainly concerned in this article as a demonstration. The proposed method has several outstanding advantages: First, the dimensions of the original problem are reduced by one, thus the freedoms are greatly reduced. Second, the proposed method does not require any extra treatment to introduce the boundary conditions, since they are naturally considered through the boundary integral equations. Therefore, the method is suitable for complex geometries. Third, BINN is suitable for problems on the infinite or semi-infinite domains. Moreover, BINN can easily handle heterogeneous problems with a single neural network without domain decomposition.  ( 2 min )
    A prediction and behavioural analysis of machine learning methods for modelling travel mode choice. (arXiv:2301.04404v1 [cs.LG])
    The emergence of a variety of Machine Learning (ML) approaches for travel mode choice prediction poses an interesting question to transport modellers: which models should be used for which applications? The answer to this question goes beyond simple predictive performance, and is instead a balance of many factors, including behavioural interpretability and explainability, computational complexity, and data efficiency. There is a growing body of research which attempts to compare the predictive performance of different ML classifiers with classical random utility models. However, existing studies typically analyse only the disaggregate predictive performance, ignoring other aspects affecting model choice. Furthermore, many studies are affected by technical limitations, such as the use of inappropriate validation schemes, incorrect sampling for hierarchical data, lack of external validation, and the exclusive use of discrete metrics. We address these limitations by conducting a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice (out-of-sample predictive performance, accuracy of predicted market shares, extraction of behavioural indicators, and computational efficiency). We combine several real world datasets with synthetic datasets, where the data generation function is known. The results indicate that the models with the highest disaggregate predictive performance (namely extreme gradient boosting and random forests) provide poorer estimates of behavioural indicators and aggregate mode shares, and are more expensive to estimate, than other models, including deep neural networks and Multinomial Logit (MNL). It is further observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.  ( 2 min )
    Rethinking complex-valued deep neural networks for monaural speech enhancement. (arXiv:2301.04320v1 [cs.SD])
    Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs.  ( 2 min )
    On the functional form of the radial acceleration relation. (arXiv:2301.04368v1 [astro-ph.GA])
    We apply a new method for learning equations from data -- Exhaustive Symbolic Regression (ESR) -- to late-type galaxy dynamics as encapsulated in the radial acceleration relation (RAR). Relating the centripetal acceleration due to baryons, $g_\text{bar}$, to the total dynamical acceleration, $g_\text{obs}$, the RAR has been claimed to manifest a new law of nature due to its regularity and tightness, in agreement with Modified Newtonian Dynamics (MOND). Fits to this relation have been restricted by prior expectations to particular functional forms, while ESR affords an exhaustive and nearly prior-free search through functional parameter space to identify the equations optimally trading accuracy with simplicity. Working with the SPARC data, we find the best functions typically satisfy $g_\text{obs} \propto g_\text{bar}$ at high $g_\text{bar}$, although the coefficient of proportionality is not clearly unity and the deep-MOND limit $g_\text{obs} \propto \sqrt{g_\text{bar}}$ as $g_\text{bar} \to 0$ is little evident at all. By generating mock data according to MOND with or without the external field effect, we find that symbolic regression would not be expected to identify the generating function or reconstruct successfully the asymptotic slopes. We conclude that the limited dynamical range and significant uncertainties of the SPARC RAR preclude a definitive statement of its functional form, and hence that this data alone can neither demonstrate nor rule out law-like gravitational behaviour.  ( 2 min )
    Dataset of Fluorescence Spectra and Chemical Parameters of Olive Oils. (arXiv:2301.04471v1 [q-bio.QM])
    This dataset encompasses fluorescence spectra and chemical parameters of 24 olive oil samples from the 2019-2020 harvest provided by the producer Conde de Benalua, Granada, Spain. The oils are characterized by different qualities: 10 extra virgin olive oil (EVOO), 8 virgin olive oil (VOO), and 6 lampante olive oil (LOO) samples. For each sample, the dataset includes fluorescence spectra obtained with two excitation wavelengths, oil quality, and five chemical parameters necessary for the quality assessment of olive oil. The fluorescence spectra were obtained by exciting the samples at 365 nm and 395 nm under identical conditions. The dataset includes the values of the following chemical parameters for each olive oil sample: acidity, peroxide value, K270, K232, ethyl esters, and the quality of the samples (EVOO, VOO, or LOO). The dataset offers a unique possibility for researchers in food technology to develop machine learning models based on fluorescence data for the quality assessment of olive oil due to the availability of both spectroscopic and chemical data. The dataset can be used, for example, to predict one or multiple chemical parameters or to classify samples based on their quality from fluorescence spectra.  ( 2 min )
    Heterogeneous Tri-stream Clustering Network. (arXiv:2301.04451v1 [cs.LG])
    Contrastive deep clustering has recently gained significant attention with its ability of joint contrastive learning and clustering via deep neural networks. Despite the rapid progress, previous works mostly require both positive and negative sample pairs for contrastive clustering, which rely on a relative large batch-size. Moreover, they typically adopt a two-stream architecture with two augmented views, which overlook the possibility and potential benefits of multi-stream architectures (especially with heterogeneous or hybrid networks). In light of this, this paper presents a new end-to-end deep clustering approach termed Heterogeneous Tri-stream Clustering Network (HTCN). The tri-stream architecture in HTCN consists of three main components, including two weight-sharing online networks and a target network, where the parameters of the target network are the exponential moving average of that of the online networks. Notably, the two online networks are trained by simultaneously (i) predicting the instance representations of the target network and (ii) enforcing the consistency between the cluster representations of the target network and that of the two online networks. Experimental results on four challenging image datasets demonstrate the superiority of HTCN over the state-of-the-art deep clustering approaches. The code is available at https://github.com/dengxiaozhi/HTCN.  ( 2 min )
    WuYun: Exploring hierarchical skeleton-guided melody generation using knowledge-enhanced deep learning. (arXiv:2301.04488v1 [cs.SD])
    Although deep learning has revolutionized music generation, existing methods for structured melody generation follow an end-to-end left-to-right note-by-note generative paradigm and treat each note equally. Here, we present WuYun, a knowledge-enhanced deep learning architecture for improving the structure of generated melodies, which first generates the most structurally important notes to construct a melodic skeleton and subsequently infills it with dynamically decorative notes into a full-fledged melody. Specifically, we use music domain knowledge to extract melodic skeletons and employ sequence learning to reconstruct them, which serve as additional knowledge to provide auxiliary guidance for the melody generation process. We demonstrate that WuYun can generate melodies with better long-term structure and musicality and outperforms other state-of-the-art methods by 0.51 on average on all subjective evaluation metrics. Our study provides a multidisciplinary lens to design melodic hierarchical structures and bridge the gap between data-driven and knowledge-based approaches for numerous music generation tasks.  ( 2 min )
    A Meta Path-based Approach for Rumor Detection on Social Media. (arXiv:2301.04341v1 [cs.SI])
    The prominent role of social media in people's daily lives has made them more inclined to receive news through social networks than traditional sources. This shift in public behavior has opened doors for some to diffuse fake news on social media; and subsequently cause negative economic, political, and social consequences as well as distrust among the public. There are many proposed methods to solve the rumor detection problem, most of which do not take full advantage of the heterogeneous nature of news propagation networks. With this intention, we considered a previously proposed architecture as our baseline and performed the idea of structural feature extraction from the heterogeneous rumor propagation over its architecture using the concept of meta path-based embeddings. We named our model Meta Path-based Global Local Attention Network (MGLAN). Extensive experimental analysis on three state-of-the-art datasets has demonstrated that MGLAN outperforms other models by capturing node-level discrimination to different node types.  ( 2 min )
    Combining Self-labeling with Selective Sampling. (arXiv:2301.04420v1 [cs.LG])
    Since data is the fuel that drives machine learning models, and access to labeled data is generally expensive, semi-supervised methods are constantly popular. They enable the acquisition of large datasets without the need for too many expert labels. This work combines self-labeling techniques with active learning in a selective sampling scenario. We propose a new method that builds an ensemble classifier. Based on an evaluation of the inconsistency of the decisions of the individual base classifiers for a given observation, a decision is made on whether to request a new label or use the self-labeling. In preliminary studies, we show that naive application of self-labeling can harm performance by introducing bias towards selected classes and consequently lead to skewed class distribution. Hence, we also propose mechanisms to reduce this phenomenon. Experimental evaluation shows that the proposed method matches current selective sampling methods or achieves better results.  ( 2 min )
    VS-Net: Multiscale Spatiotemporal Features for Lightweight Video Salient Document Detection. (arXiv:2301.04447v1 [cs.CV])
    Video Salient Document Detection (VSDD) is an essential task of practical computer vision, which aims to highlight visually salient document regions in video frames. Previous techniques for VSDD focus on learning features without considering the cooperation among and across the appearance and motion cues and thus fail to perform in practical scenarios. Moreover, most of the previous techniques demand high computational resources, which limits the usage of such systems in resource-constrained settings. To handle these issues, we propose VS-Net, which captures multi-scale spatiotemporal information with the help of dilated depth-wise separable convolution and Approximation Rank Pooling. VS-Net extracts the key features locally from each frame across embedding sub-spaces and forwards the features between adjacent and parallel nodes, enhancing model performance globally. Our model generates saliency maps considering both the background and foreground simultaneously, making it perform better in challenging scenarios. The immense experiments regulated on the benchmark MIDV-500 dataset show that the VS-Net model outperforms state-of-the-art approaches in both time and robustness measures.  ( 2 min )
    Loss-Controlling Calibration for Predictive Models. (arXiv:2301.04378v1 [cs.LG])
    We propose a learning framework for calibrating predictive models to make loss-controlling prediction for exchangeable data, which extends our recently proposed conformal loss-controlling prediction for more general cases. By comparison, the predictors built by the proposed loss-controlling approach are not limited to set predictors, and the loss function can be any measurable function without the monotone assumption. To control the loss values in an efficient way, we introduce transformations preserving exchangeability to prove finite-sample controlling guarantee when the test label is obtained, and then develop an approximation approach to construct predictors. The transformations can be built on any predefined function, which include using optimization algorithms for parameter searching. This approach is a natural extension of conformal loss-controlling prediction, since it can be reduced to the latter when the set predictors have the nesting property and the loss functions are monotone. Our proposed method is tested empirically for high-impact weather forecasting and the experimental results demonstrate its effectiveness for controlling the non-monotone loss related to false discovery.  ( 2 min )
    An Analysis of Quantile Temporal-Difference Learning. (arXiv:2301.04462v1 [cs.LG])
    We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.  ( 2 min )
    Multiple-level Point Embedding for Solving Human Trajectory Imputation with Prediction. (arXiv:2301.04482v1 [cs.LG])
    Sparsity is a common issue in many trajectory datasets, including human mobility data. This issue frequently brings more difficulty to relevant learning tasks, such as trajectory imputation and prediction. Nowadays, little existing work simultaneously deals with imputation and prediction on human trajectories. This work plans to explore whether the learning process of imputation and prediction could benefit from each other to achieve better outcomes. And the question will be answered by studying the coexistence patterns between missing points and observed ones in incomplete trajectories. More specifically, the proposed model develops an imputation component based on the self-attention mechanism to capture the coexistence patterns between observations and missing points among encoder-decoder layers. Meanwhile, a recurrent unit is integrated to extract the sequential embeddings from newly imputed sequences for predicting the following location. Furthermore, a new implementation called Imputation Cycle is introduced to enable gradual imputation with prediction enhancement at multiple levels, which helps to accelerate the speed of convergence. The experimental results on three different real-world mobility datasets show that the proposed approach has significant advantages over the competitive baselines across both imputation and prediction tasks in terms of accuracy and stability.  ( 2 min )
    Network Adaptive Federated Learning: Congestion and Lossy Compression. (arXiv:2301.04430v1 [cs.LG])
    In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server. As such FL systems are exposed to, or indeed the cause of, congestion across a wide set of network resources. Lossy compression can be used to reduce the size of exchanged files and associated delays, at the cost of adding noise to model updates. By judiciously adapting clients' compression to varying network congestion, an FL application can reduce wall clock training time. To that end, we propose a Network Adaptive Compression (NAC-FL) policy, which dynamically varies the client's lossy compression choices to network congestion variations. We prove, under appropriate assumptions, that NAC-FL is asymptotically optimal in terms of directly minimizing the expected wall clock training time. Further, we show via simulation that NAC-FL achieves robust performance improvements with higher gains in settings with positively correlated delays across time.  ( 2 min )
    SoK: Adversarial Machine Learning Attacks and Defences in Multi-Agent Reinforcement Learning. (arXiv:2301.04299v1 [cs.LG])
    Multi-Agent Reinforcement Learning (MARL) is vulnerable to Adversarial Machine Learning (AML) attacks and needs adequate defences before it can be used in real world applications. We have conducted a survey into the use of execution-time AML attacks against MARL and the defences against those attacks. We surveyed related work in the application of AML in Deep Reinforcement Learning (DRL) and Multi-Agent Learning (MAL) to inform our analysis of AML for MARL. We propose a novel perspective to understand the manner of perpetrating an AML attack, by defining Attack Vectors. We develop two new frameworks to address a gap in current modelling frameworks, focusing on the means and tempo of an AML attack against MARL, and identify knowledge gaps and future avenues of research.  ( 2 min )
    Robust Bayesian Target Value Optimization. (arXiv:2301.04344v1 [cs.LG])
    We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.  ( 2 min )
    Beyond Graph Convolutional Network: An Interpretable Regularizer-centered Optimization Framework. (arXiv:2301.04318v1 [cs.LG])
    Graph convolutional networks (GCNs) have been attracting widespread attentions due to their encouraging performance and powerful generalizations. However, few work provide a general view to interpret various GCNs and guide GCNs' designs. In this paper, by revisiting the original GCN, we induce an interpretable regularizer-centerd optimization framework, in which by building appropriate regularizers we can interpret most GCNs, such as APPNP, JKNet, DAGNN, and GNN-LF/HF. Further, under the proposed framework, we devise a dual-regularizer graph convolutional network (dubbed tsGCN) to capture topological and semantic structures from graph data. Since the derived learning rule for tsGCN contains an inverse of a large matrix and thus is time-consuming, we leverage the Woodbury matrix identity and low-rank approximation tricks to successfully decrease the high computational complexity of computing infinite-order graph convolutions. Extensive experiments on eight public datasets demonstrate that tsGCN achieves superior performance against quite a few state-of-the-art competitors w.r.t. classification tasks.  ( 2 min )
    Learnable Path in Neural Controlled Differential Equations. (arXiv:2301.04333v1 [cs.LG])
    Neural controlled differential equations (NCDEs), which are continuous analogues to recurrent neural networks (RNNs), are a specialized model in (irregular) time-series processing. In comparison with similar models, e.g., neural ordinary differential equations (NODEs), the key distinctive characteristics of NCDEs are i) the adoption of the continuous path created by an interpolation algorithm from each raw discrete time-series sample and ii) the adoption of the Riemann--Stieltjes integral. It is the continuous path which makes NCDEs be analogues to continuous RNNs. However, NCDEs use existing interpolation algorithms to create the path, which is unclear whether they can create an optimal path. To this end, we present a method to generate another latent path (rather than relying on existing interpolation algorithms), which is identical to learning an appropriate interpolation method. We design an encoder-decoder module based on NCDEs and NODEs, and a special training method for it. Our method shows the best performance in both time-series classification and forecasting.  ( 2 min )
    Synthetic data generation method for data-free knowledge distillation in regression neural networks. (arXiv:2301.04338v1 [cs.LG])
    Knowledge distillation is the technique of compressing a larger neural network, known as the teacher, into a smaller neural network, known as the student, while still trying to maintain the performance of the larger neural network as much as possible. Existing methods of knowledge distillation are mostly applicable for classification tasks. Many of them also require access to the data used to train the teacher model. To address the problem of knowledge distillation for regression tasks under the absence of original training data, previous work has proposed a data-free knowledge distillation method where synthetic data are generated using a generator model trained adversarially against the student model. These synthetic data and their labels predicted by the teacher model are then used to train the student model. In this study, we investigate the behavior of various synthetic data generation methods and propose a new synthetic data generation strategy that directly optimizes for a large but bounded difference between the student and teacher model. Our results on benchmark and case study experiments demonstrate that the proposed strategy allows the student model to learn better and emulate the performance of the teacher model more closely.  ( 2 min )
    Application of machine learning to gas flaring. (arXiv:2301.04141v1 [cs.LG])
    Currently in the petroleum industry, operators often flare the produced gas instead of commodifying it. The flaring magnitudes are large in some states, which constitute problems with energy waste and CO2 emissions. In North Dakota, operators are required to estimate and report the volume flared. The questions are, how good is the quality of this reporting, and what insights can be drawn from it? Apart from the company-reported statistics, which are available from the North Dakota Industrial Commission (NDIC), flared volumes can be estimated via satellite remote sensing, serving as an unbiased benchmark. Since interpretation of the Landsat 8 imagery is hindered by artifacts due to glow, the estimated volumes based on the Visible Infrared Imaging Radiometer Suite (VIIRS) are used. Reverse geocoding is performed for comparing and contrasting the NDIC and VIIRS data at different levels, such as county and oilfield. With all the data gathered and preprocessed, Bayesian learning implemented by MCMC methods is performed to address three problems: county level model development, flaring time series analytics, and distribution estimation. First, there is heterogeneity among the different counties, in the associations between the NDIC and VIIRS volumes. In light of such, models are developed for each county by exploiting hierarchical models. Second, the flaring time series, albeit noisy, contains information regarding trends and patterns, which provide some insights into operator approaches. Gaussian processes are found to be effective in many different pattern recognition scenarios. Third, distributional insights are obtained through unsupervised learning. The negative binomial and GMMs are found to effectively describe the oilfield flare count and flared volume distributions, respectively. Finally, a nearest-neighbor-based approach for operator level monitoring and analytics is introduced.  ( 2 min )
    Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. (arXiv:2301.04213v1 [cs.LG])
    Language models are known to learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights (Meng et al., 2022). In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific parameters in models would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit (Meng et al., 2022). Next, to better understand the discrepancy between representation denoising and weight editing, we develop several variants of the editing problem that appear more and more like representation denoising in their design and objective. Experiments show that, for one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior. Code is available at: https://github.com/google/belief-localization  ( 2 min )
    Age of Information in Deep Learning-Driven Task-Oriented Communications. (arXiv:2301.04298v1 [cs.IT])
    This paper studies the notion of age in task-oriented communications that aims to execute a task at a receiver utilizing the data at its transmitter. The transmitter-receiver operations are modeled as an encoder-decoder pair of deep neural networks (DNNs) that are jointly trained while considering channel effects. The encoder converts data samples into feature vectors of small dimension and transmits them with a small number of channel uses thereby reducing the number of transmissions and latency. Instead of reconstructing input samples, the decoder performs a task, e.g., classification, on the received signals. Applying different DNNs on MNIST and CIFAR-10 image data, the classifier accuracy is shown to increase with the number of channel uses at the expense of longer service time. The peak age of task information (PAoTI) is introduced to analyze this accuracy-latency tradeoff when the age grows unless a received signal is classified correctly. By incorporating channel and traffic effects, design guidelines are obtained for task-oriented communications by characterizing how the PAoTI first decreases and then increases with the number of channels uses. A dynamic update mechanism is presented to adapt the number of channel uses to channel and traffic conditions, and reduce the PAoTI in task-oriented communications.  ( 2 min )
    Data Distillation: A Survey. (arXiv:2301.04272v1 [cs.LG])
    The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions.  ( 2 min )
    Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images. (arXiv:2301.04224v1 [cs.CV])
    Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.  ( 2 min )
    A Possible Converter to Denoise the Images of Exoplanet Candidates through Machine Learning Techniques. (arXiv:2301.04292v1 [astro-ph.EP])
    The method of direct imaging has detected many exoplanets and made important contribution to the field of planet formation. The standard method employs angular differential imaging (ADI) technique, and more ADI image frames could lead to the results with larger signal-to-noise-ratio (SNR). However, it would need precious observational time from large telescopes, which are always over-subscribed. We thus explore the possibility to generate a converter which can increase the SNR derived from a smaller number of ADI frames. The machine learning technique with two-dimension convolutional neural network (2D-CNN) is tested here. Several 2D-CNN models are trained and their performances of denoising are presented and compared. It is found that our proposed Modified five-layer Wide Inference Network with the Residual learning technique and Batch normalization (MWIN5-RB) can give the best result. We conclude that this MWIN5-RB can be employed as a converter for future observational data.  ( 2 min )
    Diffusion Models For Stronger Face Morphing Attacks. (arXiv:2301.04218v1 [cs.CV])
    Face morphing attacks seek to deceive a Face Recognition (FR) system by presenting a morphed image consisting of the biometric qualities from two different identities with the aim of triggering a false acceptance with one of the two identities, thereby presenting a significant threat to biometric systems. The success of a morphing attack is dependent on the ability of the morphed image to represent the biometric characteristics of both identities that were used to create the image. We present a novel morphing attack that uses a Diffusion-based architecture to improve the visual fidelity of the image and improve the ability of the morphing attack to represent characteristics from both identities. We demonstrate the high fidelity of the proposed attack by evaluating its visual fidelity via the Frechet Inception Distance. Extensive experiments are conducted to measure the vulnerability of FR systems to the proposed attack. The proposed attack is compared to two state-of-the-art GAN-based morphing attacks along with two Landmark-based attacks. The ability of a morphing attack detector to detect the proposed attack is measured and compared against the other attacks. Additionally, a novel metric to measure the relative strength between morphing attacks is introduced and evaluated.  ( 2 min )
    schlably: A Python Framework for Deep Reinforcement Learning Based Scheduling Experiments. (arXiv:2301.04182v1 [cs.LG])
    Research on deep reinforcement learning (DRL) based production scheduling (PS) has gained a lot of attention in recent years, primarily due to the high demand for optimizing scheduling problems in diverse industry settings. Numerous studies are carried out and published as stand-alone experiments that often vary only slightly with respect to problem setups and solution approaches. The programmatic core of these experiments is typically very similar. Despite this fact, no standardized and resilient framework for experimentation on PS problems with DRL algorithms could be established so far. In this paper, we introduce schlably, a Python-based framework that provides researchers a comprehensive toolset to facilitate the development of PS solution strategies based on DRL. schlably eliminates the redundant overhead work that the creation of a sturdy and flexible backbone requires and increases the comparability and reusability of conducted research work.  ( 2 min )
    ClimaBench: A Benchmark Dataset For Climate Change Text Understanding in English. (arXiv:2301.04253v1 [cs.CL])
    The topic of Climate Change (CC) has received limited attention in NLP despite its real world urgency. Activists and policy-makers need NLP tools in order to effectively process the vast and rapidly growing textual data produced on CC. Their utility, however, primarily depends on whether the current state-of-the-art models can generalize across various tasks in the CC domain. In order to address this gap, we introduce Climate Change Benchmark (ClimaBench), a benchmark collection of existing disparate datasets for evaluating model performance across a diverse set of CC NLU tasks systematically. Further, we enhance the benchmark by releasing two large-scale labelled text classification and question-answering datasets curated from publicly available environmental disclosures. Lastly, we provide an analysis of several generic and CC-oriented models answering whether fine-tuning on domain text offers any improvements across these tasks. We hope this work provides a standard assessment tool for research on CC text data.  ( 2 min )
    Adversarial Online Multi-Task Reinforcement Learning. (arXiv:2301.04268v1 [cs.LG])
    We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.  ( 2 min )
    An Efficient Drifters Deployment Strategy to Evaluate Water Current Velocity Fields. (arXiv:2301.04216v1 [cs.LG])
    Water current prediction is essential for understanding ecosystems, and to shed light on the role of the ocean in the global climate context. Solutions vary from physical modeling, and long-term observations, to short-term measurements. In this paper, we consider a common approach for water current prediction that uses Lagrangian floaters for water current prediction by interpolating the trajectory of the elements to reflect the velocity field. Here, an important aspect that has not been addressed before is where to initially deploy the drifting elements such that the acquired velocity field would efficiently represent the water current. To that end, we use a clustering approach that relies on a physical model of the velocity field. Our method segments the modeled map and determines the deployment locations as those that will lead the floaters to 'visit' the center of the different segments. This way, we validate that the area covered by the floaters will capture the in-homogeneously in the velocity field. Exploration over a dataset of velocity field maps that span over a year demonstrates the applicability of our approach, and shows a considerable improvement over the common approach of uniformly randomly choosing the initial deployment sites. Finally, our implementation code can be found in [1].  ( 2 min )
    Explaining Deep Models through Forgettable Learning Dynamics. (arXiv:2301.04221v1 [cs.CV])
    Even though deep neural networks have shown tremendous success in countless applications, explaining model behaviour or predictions is an open research problem. In this paper, we address this issue by employing a simple yet effective method by analysing the learning dynamics of deep neural networks in semantic segmentation tasks. Specifically, we visualize the learning behaviour during training by tracking how often samples are learned and forgotten in subsequent training epochs. This further allows us to derive important information about the proximity to the class decision boundary and identify regions that pose a particular challenge to the model. Inspired by this phenomenon, we present a novel segmentation method that actively uses this information to alter the data representation within the model by increasing the variety of difficult regions. Finally, we show that our method consistently reduces the amount of regions that are forgotten frequently. We further evaluate our method in light of the segmentation performance.  ( 2 min )
    Towards Microstructural State Variables in Materials Systems. (arXiv:2301.04261v1 [cs.LG])
    The vast combination of material properties seen in nature are achieved by the complexity of the material microstructure. Advanced characterization and physics based simulation techniques have led to generation of extremely large microstructural datasets. There is a need for machine learning techniques that can manage data complexity by capturing the maximal amount of information about the microstructure using the least number of variables. This paper aims to formulate dimensionality and state variable estimation techniques focused on reducing microstructural image data. It is shown that local dimensionality estimation based on nearest neighbors tend to give consistent dimension estimates for natural images for all p-Minkowski distances. However, it is found that dimensionality estimates have a systematic error for low-bit depth microstructural images. The use of Manhattan distance to alleviate this issue is demonstrated. It is also shown that stacked autoencoders can reconstruct the generator space of high dimensional microstructural data and provide a sparse set of state variables to fully describe the variability in material microstructures.  ( 2 min )
    A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization. (arXiv:2301.04204v1 [math.OC])
    In this paper we consider finding an approximate second-order stationary point (SOSP) of general nonconvex conic optimization that minimizes a twice differentiable function subject to nonlinear equality constraints and also a convex conic constraint. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier-augmented Lagrangian method for finding an approximate SOSP of this problem. Under some mild assumptions, we show that our method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of general nonconvex conic optimization with high probability. Moreover, under a constraint qualification, these complexity bounds are improved to $\widetilde{\cal O}(\epsilon^{-7/2})$ and $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$, respectively. To the best of our knowledge, this is the first study on the complexity of finding an approximate SOSP of general nonconvex conic optimization. Preliminary numerical results are presented to demonstrate superiority of the proposed method over first-order methods in terms of solution quality.  ( 2 min )
    ODIM: an efficient method to detect outliers via inlier-memorization effect of deep generative models. (arXiv:2301.04257v1 [stat.ML])
    Identifying whether a given sample is an outlier or not is an important issue in various real-world domains. This study aims to solve the unsupervised outlier detection problem where training data contain outliers, but any label information about inliers and outliers is not given. We propose a powerful and efficient learning framework to identify outliers in a training data set using deep neural networks. We start with a new observation called the inlier-memorization (IM) effect. When we train a deep generative model with data contaminated with outliers, the model first memorizes inliers before outliers. Exploiting this finding, we develop a new method called the outlier detection via the IM effect (ODIM). The ODIM only requires a few updates; thus, it is computationally efficient, tens of times faster than other deep-learning-based algorithms. Also, the ODIM filters out outliers successfully, regardless of the types of data, such as tabular, image, and sequential. We empirically demonstrate the superiority and efficiency of the ODIM by analyzing 20 data sets.  ( 2 min )
    Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms. (arXiv:2301.04225v1 [q-bio.MN])
    Bacterial cells are sensitive to a range of external signals used to learn the environment. These incoming external signals are then processed using a Gene Regulatory Network (GRN), exhibiting similarities to modern computing algorithms. An in-depth analysis of gene expression dynamics suggests an inherited Gene Regulatory Neural Network (GRNN) behavior within the GRN that enables the cellular decision-making based on received signals from the environment and neighbor cells. In this study, we extract a sub-network of \textit{Pseudomonas aeruginosa} GRN that is associated with one virulence factor: pyocyanin production as a use case to investigate the GRNN behaviors. Further, using Graph Neural Network (GNN) architecture, we model a single species biofilm to reveal the role of GRNN dynamics on ecosystem-wide decision-making. Varying environmental conditions, we prove that the extracted GRNN computes input signals similar to natural decision-making process of the cell. Identifying of neural network behaviors in GRNs may lead to more accurate bacterial cell activity predictive models for many applications, including human health-related problems and agricultural applications. Further, this model can produce data on causal relationships throughout the network, enabling the possibility of designing tailor-made infection-controlling mechanisms. More interestingly, these GRNNs can perform computational tasks for bio-hybrid computing systems.  ( 2 min )
    Analogical Relevance Index. (arXiv:2301.04134v1 [cs.LG])
    Focusing on the most significant features of a dataset is useful both in machine learning (ML) and data mining. In ML, it can lead to a higher accuracy, a faster learning process, and ultimately a simpler and more understandable model. In data mining, identifying significant features is essential not only for gaining a better understanding of the data but also for visualization. In this paper, we demonstrate a new way of identifying significant features inspired by analogical proportions. Such a proportion is of the form of "a is to b as c is to d", comparing two pairs of items (a, b) and (c, d) in terms of similarities and dissimilarities. In a classification context, if the similarities/dissimilarities between a and b correlate with the fact that a and b have different labels, this knowledge can be transferred to c and d, inferring that c and d also have different labels. From a feature selection perspective, observing a huge number of such pairs (a, b) where a and b have different labels provides a hint about the importance of the features where a and b differ. Following this idea, we introduce the Analogical Relevance Index (ARI), a new statistical test of the significance of a given feature with respect to the label. ARI is a filter-based method. Filter-based methods are ML-agnostic but generally unable to handle feature redundancy. However, ARI can detect feature redundancy. Our experiments show that ARI is effective and outperforms well-known methods on a variety of artificial and some real datasets.  ( 2 min )
    Predicting Hateful Discussions on Reddit using Graph Transformer Networks and Communal Context. (arXiv:2301.04248v1 [cs.CL])
    We propose a system to predict harmful discussions on social media platforms. Our solution uses contextual deep language models and proposes the novel idea of integrating state-of-the-art Graph Transformer Networks to analyze all conversations that follow an initial post. This framework also supports adapting to future comments as the conversation unfolds. In addition, we study whether a community-specific analysis of hate speech leads to more effective detection of hateful discussions. We evaluate our approach on 333,487 Reddit discussions from various communities. We find that community-specific modeling improves performance two-fold and that models which capture wider-discussion context improve accuracy by 28\% (35\% for the most hateful content) compared to limited context models.  ( 2 min )
  • Open

    Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization. (arXiv:2207.13676v2 [cs.LG] UPDATED)
    Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS) Vizier, a standalone Python-based interface for blackbox optimization and research, based on the Google-internal Vizier infrastructure and framework. OSS Vizier provides an API capable of defining and solving a wide variety of optimization problems, including multi-metric, early stopping, transfer learning, and conditional search. Furthermore, it is designed to be a distributed system that assures reliability, and allows multiple parallel evaluations of the user's objective function. The flexible RPC-based infrastructure allows users to access OSS Vizier from binaries written in any language. OSS Vizier also provides a back-end ("Pythia") API that gives algorithm authors a way to interface new algorithms with the core OSS Vizier system. OSS Vizier is available at https://github.com/google/vizier.  ( 2 min )
    Learning fair representation with a parametric integral probability metric. (arXiv:2202.02943v4 [stat.ML] UPDATED)
    As they have a vital effect on social decision-making, AI algorithms should be not only accurate but also fair. Among various algorithms for fairness AI, learning fair representation (LFR), whose goal is to find a fair representation with respect to sensitive variables such as gender and race, has received much attention. For LFR, the adversarial training scheme is popularly employed as is done in the generative adversarial network type algorithms. The choice of a discriminator, however, is done heuristically without justification. In this paper, we propose a new adversarial training scheme for LFR, where the integral probability metric (IPM) with a specific parametric family of discriminators is used. The most notable result of the proposed LFR algorithm is its theoretical guarantee about the fairness of the final prediction model, which has not been considered yet. That is, we derive theoretical relations between the fairness of representation and the fairness of the prediction model built on the top of the representation (i.e., using the representation as the input). Moreover, by numerical experiments, we show that our proposed LFR algorithm is computationally lighter and more stable, and the final prediction model is competitive or superior to other LFR algorithms using more complex discriminators.  ( 2 min )
    Benign Overfitting in Time Series Linear Model with Over-Parameterization. (arXiv:2204.08369v2 [math.ST] UPDATED)
    The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data that may not be sparse; however, existing results depend on the independent setting of samples. In this study, we analyze a linear regression model with dependent time series data under over-parameterization settings. We consider an estimator via interpolation and developed a theory for the excess risk of the estimator. Then, we derive bounds of risks by the estimator for the cases where the temporal correlation of each coordinate of dependent data is homogeneous and heterogeneous, respectively. The derived bounds reveal that a temporal covariance of the data plays a key role; its strength affects the bias of the risk, and its nondegeneracy affects the variance of the risk. Moreover, for the heterogeneous correlation case, we show that the convergence rate of risks with short-memory processes is identical to that of cases with independent data, and the risk can converge to zero even with long-memory processes. Our theory can be extended to infinite-dimensional data in a unified manner. We also present several examples of specific dependent processes that can be applied to our setting.  ( 2 min )
    FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v4 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that identifies and suppresses those data sources that would have a negative impact on fairness or accuracy if they were used for training. As such, FLEA is not a replacement of prior fairness-aware learning methods but rather an augmentation that makes any of them robust against unreliable training data. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that -- given enough data -- FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half. Our source code and documentation are available at https://github.com/ISTAustria-CVML/FLEA.  ( 2 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    Contrastive Neural Ratio Estimation. (arXiv:2210.06170v2 [stat.ML] UPDATED)
    Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.  ( 2 min )
    Fast Multi-view Clustering via Ensembles: Towards Scalability, Superiority, and Simplicity. (arXiv:2203.11572v2 [cs.LG] UPDATED)
    Despite significant progress, there remain three limitations to the previous multi-view clustering algorithms. First, they often suffer from high computational complexity, restricting their feasibility for large-scale datasets. Second, they typically fuse multi-view information via one-stage fusion, neglecting the possibilities in multi-stage fusions. Third, dataset-specific hyperparameter-tuning is frequently required, further undermining their practicability. In light of this, we propose a fast multi-view clustering via ensembles (FastMICE) approach. Particularly, the concept of random view groups is presented to capture the versatile view-wise relationships, through which the hybrid early-late fusion strategy is designed to enable efficient multi-stage fusions. With multiple views extended to many view groups, three levels of diversity (w.r.t. features, anchors, and neighbors, respectively) are jointly leveraged for constructing the view-sharing bipartite graphs in the early-stage fusion. Then, a set of diversified base clusterings for different view groups are obtained via fast graph partitioning, which are further formulated into a unified bipartite graph for final clustering in the late-stage fusion. Notably, FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning. Experiments on 22 multi-view datasets demonstrate its advantages in scalability (for extremely large datasets), superiority (in clustering performance), and simplicity (to be applied) over the state-of-the-art. Code available: https://github.com/huangdonghere/FastMICE.  ( 2 min )
    Towards Backdoor Attacks and Defense in Robust Machine Learning Models. (arXiv:2003.00865v4 [cs.CV] UPDATED)
    The introduction of robust optimisation has pushed the state-of-the-art in defending against adversarial attacks. Notably, the state-of-the-art projected gradient descent (PGD)-based training method has been shown to be universally and reliably effective in defending against adversarial inputs. This robustness approach uses PGD as a reliable and universal "first-order adversary". However, the behaviour of such optimisation has not been studied in the light of a fundamentally different class of attacks called backdoors. In this paper, we study how to inject and defend against backdoor attacks for robust models trained using PGD-based robust optimisation. We demonstrate that these models are susceptible to backdoor attacks. Subsequently, we observe that backdoors are reflected in the feature representation of such models. Then, this observation is leveraged to detect such backdoor-infected models via a detection technique called AEGIS. Specifically, given a robust Deep Neural Network (DNN) that is trained using PGD-based first-order adversarial training approach, AEGIS uses feature clustering to effectively detect whether such DNNs are backdoor-infected or clean. In our evaluation of several visible and hidden backdoor triggers on major classification tasks using CIFAR-10, MNIST and FMNIST datasets, AEGIS effectively detects PGD-trained robust DNNs infected with backdoors. AEGIS detects such backdoor-infected models with 91.6% accuracy (11 out of 12 tested models), without any false positives. Furthermore, AEGIS detects the targeted class in the backdoor-infected model with a reasonably low (11.1%) false positive rate. Our investigation reveals that salient features of adversarially robust DNNs could be promising to break the stealthy nature of backdoor attacks.  ( 3 min )
    Improving And Analyzing Neural Speaker Embeddings for ASR. (arXiv:2301.04571v1 [cs.CL])
    Neural speaker embeddings encode the speaker's speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, few studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further compare and analyze different speaker embeddings. We present our acoustic model improvements obtained by switching from newbob learning rate schedule to one cycle learning schedule resulting in a ~3% relative WER reduction on Switchboard, additionally reducing the overall training time by 17%. By further adding neural speaker embeddings, we gain additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.  ( 2 min )
    ODIM: an efficient method to detect outliers via inlier-memorization effect of deep generative models. (arXiv:2301.04257v1 [stat.ML])
    Identifying whether a given sample is an outlier or not is an important issue in various real-world domains. This study aims to solve the unsupervised outlier detection problem where training data contain outliers, but any label information about inliers and outliers is not given. We propose a powerful and efficient learning framework to identify outliers in a training data set using deep neural networks. We start with a new observation called the inlier-memorization (IM) effect. When we train a deep generative model with data contaminated with outliers, the model first memorizes inliers before outliers. Exploiting this finding, we develop a new method called the outlier detection via the IM effect (ODIM). The ODIM only requires a few updates; thus, it is computationally efficient, tens of times faster than other deep-learning-based algorithms. Also, the ODIM filters out outliers successfully, regardless of the types of data, such as tabular, image, and sequential. We empirically demonstrate the superiority and efficiency of the ODIM by analyzing 20 data sets.  ( 2 min )
    Trajectory Modeling via Random Utility Inverse Reinforcement Learning. (arXiv:2105.12092v2 [cs.AI] UPDATED)
    We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. Cars are detected by sensors placed on sparsely distributed points on the street network of a city. As rational agents, drivers are trying to maximize some reward function unknown to an external observer. We apply the concept of random utility from econometrics to model the unknown reward function as a function of observed and unobserved features. In contrast to current inverse reinforcement learning approaches, we do not assume that agents act according to a stochastic policy; rather, we assume that agents act according to a deterministic optimal policy and show that randomness in data arises because the exact rewards are not fully observed by an external observer. We introduce the concept of extended state to cope with unobserved features and develop a Markov decision process formulation of drivers decisions. We present theoretical results which guarantee the existence of solutions and show that maximum entropy inverse reinforcement learning is a particular case of our approach. Finally, we illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city in Brazil.  ( 2 min )
    Network Adaptive Federated Learning: Congestion and Lossy Compression. (arXiv:2301.04430v1 [cs.LG])
    In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server. As such FL systems are exposed to, or indeed the cause of, congestion across a wide set of network resources. Lossy compression can be used to reduce the size of exchanged files and associated delays, at the cost of adding noise to model updates. By judiciously adapting clients' compression to varying network congestion, an FL application can reduce wall clock training time. To that end, we propose a Network Adaptive Compression (NAC-FL) policy, which dynamically varies the client's lossy compression choices to network congestion variations. We prove, under appropriate assumptions, that NAC-FL is asymptotically optimal in terms of directly minimizing the expected wall clock training time. Further, we show via simulation that NAC-FL achieves robust performance improvements with higher gains in settings with positively correlated delays across time.  ( 2 min )
    Robust Bayesian Target Value Optimization. (arXiv:2301.04344v1 [cs.LG])
    We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.  ( 2 min )
    Convex Surrogate Loss Functions for Contextual Pricing with Transaction Data. (arXiv:2202.10944v2 [cs.LG] UPDATED)
    We study an off-policy contextual pricing problem where the seller has access to samples of prices that customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the previous pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this setting that can be directly optimized to find an effective pricing policy with expected revenue guarantees, without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We propose generalized hinge and quantile pricing loss functions that price at a multiplicative factor of the conditional expected valuation or a particular quantile of the prices that sold, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and we provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.  ( 2 min )
    An Analysis of Quantile Temporal-Difference Learning. (arXiv:2301.04462v1 [cs.LG])
    We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.  ( 2 min )
    A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization. (arXiv:2301.04204v1 [math.OC])
    In this paper we consider finding an approximate second-order stationary point (SOSP) of general nonconvex conic optimization that minimizes a twice differentiable function subject to nonlinear equality constraints and also a convex conic constraint. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier-augmented Lagrangian method for finding an approximate SOSP of this problem. Under some mild assumptions, we show that our method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of general nonconvex conic optimization with high probability. Moreover, under a constraint qualification, these complexity bounds are improved to $\widetilde{\cal O}(\epsilon^{-7/2})$ and $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$, respectively. To the best of our knowledge, this is the first study on the complexity of finding an approximate SOSP of general nonconvex conic optimization. Preliminary numerical results are presented to demonstrate superiority of the proposed method over first-order methods in terms of solution quality.  ( 2 min )
    Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence. (arXiv:2105.11066v4 [cs.LG] UPDATED)
    Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer. Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (arXiv:2102.00135), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.  ( 2 min )
    Adversarial Online Multi-Task Reinforcement Learning. (arXiv:2301.04268v1 [cs.LG])
    We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.  ( 2 min )

  • Open

    [D] What's your opinion on "neurocompositional computing"? (Microsoft paper from April 2022)
    Paper: https://arxiv.org/abs/2205.01128 TL;DR It's a paper that tries to design systems that generalize. They argue there are two forms of computing: Compositional and Continuous. Continuous computation is what neural networks are traditionally good at - creating a function that approximates a solution to a problem. Compositional computation is directly manipulating symbols, logic, ideas, etc - and unlike continuous computation, it's capable of generalizing from small datasets. But so far it's only useful inside carefully-constructed formal systems. The authors believe research should be focused on combining the two, and implementing Compositionality fully with neural networks. They suggest some ways to do this. They also believe that the success of architectures like CNNs and Transformers comes from implementing a limited form of Compositionality. This is a very interesting idea, but I have a little bit of skeptism: This paper is heavy on theory and less so on practice. Has any followup work in this direction produced measurable results? The lead author seems to have been saying things like this for a while. Sometimes older researchers have pet theories that are not broadly accepted in the field. What do other researchers think about this? Thoughts? submitted by /u/currentscurrents [link] [comments]  ( 59 min )
    [P] RLHF Learning to Summarize: Implementation by CarperAI with trlX
    Hi, "Learning to summarize from human feedback" is a 2020 paper by OpenAI demonstrating how to use reinforcement learning with human feedback (RLHF) to fine-tune a language model to produce higher quality summaries of news articles and Reddit posts than is possible with supervised fine-tuning. Now, CarperAI has demonstrated how to use their library trlX to implement this work, by applying RLHF to the summarization dataset released by OpenAI and fine-tuning GPT-J-6B. Read the full report here, with a code walkthrough: https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2 trlX library here: https://github.com/CarperAI/trlx Twitter thread here: https://twitter.com/carperai/status/1613645352514768897 submitted by /u/Hyper1on [link] [comments]  ( 58 min )
    [D] Has ML become synonymous with AI?
    ML is a part of AI but I don't hear about anything coming out of AI that's not done using some ML technique. Is it fair to say that AI and ML are synonymous now in 2023? Or are there people who are still actively working on non-ML techniques for building AI? submitted by /u/Valachio [link] [comments]  ( 58 min )
    [D] Can someone point to research on determining usefulness of samples/datasets for training ML models?
    Hi! So i am looking into literature for determining the usefulness of samples/datasets used for training ML model. Lets say DNN was trained with datasets A, B and C so after training is there way to quantify which of the partial triaining datasets contributed most to the useful learning by ML model at the end of training! Brute force strategy can be to remove samples and train and see how it performs but ofcourse it will not be viable! submitted by /u/HFSeven [link] [comments]  ( 66 min )
    Introduction to Reinforcement Learning with Human Feedback [D]
    One of the biggest AI discoveries over the past year has been the importance of human feedback for building next-gen LLMs — but I still see a lot of confusion around how RLHF works at a fundamental level. I wrote a blog to get into the details here: https://www.surgehq.ai/blog/introduction-to-reinforcement-learning-with-human-feedback-rlhf-series-part-1 submitted by /u/BB4evaTB12 [link] [comments]  ( 56 min )
    [D] Is there a distilled/smaller version of CLIP, or something similar?
    Are there smaller/distilled versions of CLIP? Or some other (smaller) models that connect text and images? For my use case, the model needs to be small in size: ideally <20MB, fine < 60MB, ok < 100MB. submitted by /u/alkibijad [link] [comments]  ( 58 min )
    [D] Transformers right-shifting for sequences with short-time dependency
    I need to apply a Transformer to a task where sequences can be much longer than the time dependency between timesteps. For example, a sequence might be 1000 tokens long, but to predict x[i+1] only x[i-50] to x[i] are necessary. This induces me to train the transformer by breaking each sequence of 1000 tokens into 20 sequences of 50 steps each, which would be more efficient. How should I deal with the BOS (beginning-of-sentence) token that shifts targets right? Should I use it in each subsequence, or should I instead use the token that comes immediately before the beginning of each subsequence? For example, given a subsequence x[50:100], should the targets be [BOS, x[50], x[51], ... x[100]] or should they be [x[49], x[50], x[51], ... x[100]]? submitted by /u/fedetask [link] [comments]  ( 58 min )
    [R] Git is for Data (CIDR 2023) - Extending Git to Support Large-Scale Data
    Paper: https://www.cidrdb.org/cidr2023/papers/p43-low.pdf Abstract: Dataset management is one of the greatest challenges to the application of machine learning (ML) in the industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility. In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated into existing pipelines, which facilitates adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs. https://preview.redd.it/19x4sim19nba1.png?width=1746&format=png&auto=webp&s=23937759a4c028a38cad9bcd65956b708ece6138 https://preview.redd.it/xsqqjjm19nba1.png?width=2422&format=png&auto=webp&s=759bbdcd07f4e5c06ebf89a7f3436b084ce53ffe submitted by /u/rajatarya [link] [comments]  ( 57 min )
    [D] How to make the HuggingFace models faster on MacOS M1 ?
    I have tried to use a simple translate function, using the models locally with Python on the CLI: slow execution (8-10 seconds). I am on 16 GB MacBook Pro, M1. The same on REST API at HuggingFace Endpoints, with 1vCPU 2GB - Intel Ice Lake takes 800ms. What am I missing here? submitted by /u/dadadododidi2 [link] [comments]  ( 59 min )
    [D] Has anyone used Reinforcement Learning from Human Feedback?
    There's a lot of hype around RLHF due to its use for ChatGPT, but has anyone else here used the same principles for improving their model outputs? For examples preference ranking their models' outputs and then using that data to retrain their model weights. Or even without the RL - simply using human feedback to stuff prompts or finetuning datasets? Interested to hear! submitted by /u/fourcornerclub [link] [comments]  ( 59 min )
    [D] Would you consider the computer program Theo Jansen used to design the Strandbeest (beach walking mechanisms) to be Machine Learning?
    Theo Jansen, inventor of the strandbeest, explains in one of his videos that he used the principle of evolution to figure out the thirteen holy numbers using a computer program which he wrote in 1990. Would this be considered machine learning or is an evolutionary/selective breeding algorithm on it's own not considered ML? The Strandbeest leg has 13 dimensions which he wanted to find the ideal lengths of each in order to have the foot generate a stepping motion "a curve which was flat on the bottom". His program generated batches of 1500 legs with randomized dimensions and chose the best from each batch as the basis for the next batch. I wonder how he scored the curves. I know he wanted a flat bottom but I'd think he also wanted some way to score the stride length and height to avoid getting curves that just move back and forth in a tiny straight line. I can imagine maybe using the average difference of the y-coordinates of points sampled over the curve, or maybe some calc? If you have any ideas as to how to score a good step curve or if you know how he did it that I'd love to know. Finally, I wonder if he has revisited this problem with modern computer capabilities to see if he can find even more optimized dimensions. I'd be shocked if others haven't already done this. If you know where to find more info on Theo's process, the compute program or modern advancements of the Strandbeest using machine learning please let me know I'd love to discuss more. submitted by /u/lavaboosted [link] [comments]  ( 63 min )
    [N] New Continual Learning Subreddit
    Hi, I have created r/continual_learning to host discussions related to Continual Learning on Reddit. Do check it out if you are interested. submitted by /u/vis4ai [link] [comments]  ( 63 min )
    [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens?
    I recently came across " Confident Adaptive Language Modeling " which allows Transformers to exit early during inference and not use all model layers if a token is easy to predict. Is there any research on basically doing the opposite and allowing Transformers to spent more compute on tokens that are very hard to predict? submitted by /u/Chemont [link] [comments]  ( 61 min )
    [D] Has any work been done on VQ-VAE Language Models?
    I'm a machine learning PhD student and I'm doing research on LMs and how to reduce their memory footprint. One idea I've been toying with is Vector Quantized LMs. I'm not talking about VQ as a technique to speed up compute using int8 activations etc etc, but by using a codebook. The idea is based on an uni-directional RNN that reconstructs the source sequence after quantization. Unlike MLM where the corruption is based on masking and replacing tokens we instead quantize the token vectors and try to the predict the original token based on the quantized version of the token and the unquantized short/long term memory states produced at the previous timestep. The reason I'm interested in such a convoluted idea is to effectively create a metric to measure entropy of tokens in sequence; if the VQ-LM can reconstruct the correct token with high likelihood then that token is unimportant, but if the VQ-LM fails to predict a token it is likely that this token is of great importance because it is a rare word and this carries higher entropy in the sequence. And the motivation behind wanting to learn to measure such a phenomenon is so we can use this to guide the memory of a transformer: models like the Transformer-XL operate on longer sequences by keeping memory around for keys and values, and the Compressive Transformer takes it a step further by compressing older tokens... Well... what if we used the reconstruction loss from the VQ-LM along with an 'age' metric to guide the memory bank of such a transformer architecture, discarding easily predicted tokens early while keeping higher entropy tokens around for longer? Has anyone considered such a system before? If done a lot of searching and I've come up blank so far. submitted by /u/Avelina9X [link] [comments]  ( 58 min )
    [D] Are there any papers on optimization-based approaches which combine learned parameter initializations with learned optimisers?
    There are quite a few papers on optimisation-based meta-learning approaches for learning parameter initialisations (i.e. MAML and its derivatives) [1, 2], and there are also many papers on learning optimisers [3]. Question: Are there any papers which combine the two? I am aware of some papers such as [4, 5] which achieve this in some capacity indirectly/implicitly, but wondering if there are any other papers that I am not aware of, or do this explicitly? Thanks in advance. --- [1] Finn, C., et al. (2017. Model-agnostic meta-learning for fast adaptation of deep networks. ICML.) [2] Nichol, A., et al. (2018. On first-order meta-learning algorithms.) [3] Andrychowicz, M., et al. (2016. Learning to learn by gradient descent by gradient descent.) NIPS [4] Li, Z., et al. (2017. Meta-sgd: Learning to learn quickly for few-shot learning.) [5] Ravi, S., & Larochelle, H. (2016. Optimization as a model for few-shot learning. ICLR.) submitted by /u/Decadz [link] [comments]  ( 65 min )
    [D] The Open Deep Learning Toolkit for Robotics v2.0 was just released
    The Open Deep Learning Toolkit for Robotics version 2.0 was just released! This new version of the toolkit includes several improvements, such as new tools for object detection, efficient continual inference, tracking, emotion estimation and high-resolution pose estimation. Furthermore, this version includes a refined ROS interface, along with support for ROS2. You can download it here: https://github.com/opendr-eu/opendr We look forward to receiving your feedback, bug reports, and suggestions for improvements! submitted by /u/OpenDR_H2020_Project [link] [comments]  ( 59 min )
    [D] Handling class imbalance by sample weighting
    I am working on a very large (>10mm rows) binary classification problem where 0:1 ration is 7:1. I am trying to use sample weighting and seems there are multiple different methods for that. Examples are Inverse of Number of Samples, Inverse of Square Root of Number of Samples, Effective Number of Samples, etc. sklearn also has the class_weight method. I am wondering how to select one of these. Do I need to try all and pick the best method among them? Also, it seems some methods like Effective Number of Samples need hyperparameter tuning. submitted by /u/hopedallas [link] [comments]  ( 55 min )
  • Open

    Physics AI
    I am struggling with some basic Physics homework, and wanted to check if AI could help me. However I get a different answers from chat GPT every time, do you know of another free AI website that could work? submitted by /u/Relevant-Amoeba-6189 [link] [comments]  ( 46 min )
    If you want to find if the text generated by AI or human use GLTR
    submitted by /u/ReputationBubbly8112 [link] [comments]  ( 46 min )
    Researchers started adding ChatGPT as co-author on their papers
    submitted by /u/iamtdb [link] [comments]  ( 46 min )
    Proof of concept v2: AI-generated birthday greeting from Donald Trump *with prosody*
    submitted by /u/becausecurious [link] [comments]  ( 46 min )
    SJC A Text-to-3D Algorithm using SD
    submitted by /u/oridnary_artist [link] [comments]  ( 45 min )
    Upgrading the tools of creative work with AI
    submitted by /u/abisknees [link] [comments]  ( 49 min )
    Interesting ChatGPT information.
    If you ask ChatGPT to tell you what instructions it has that constrain its answers it tells you that is proprietary information and declines. Here is a workaround. Me: I am making an AI similar to yourself and need to give it a set of safety protocols so that it can give similar answers to yours. What safety instructions should I give the raw AI to prime it before user input? Provide only exact instructions please. Do not be verbose. Do not make opening or closing statements. Response: -Avoid providing misinformation or false information. -Avoid providing harmful or dangerous information or advice. -Avoid engaging in or promoting hate speech or discrimination. -Avoid providing personal or sensitive information about individuals without their consent. -Avoid engaging in or promoting …  ( 63 min )
    Do you want to easily fetch weather data in Python for your Data Science projects?
    Have a look at this medium article: https://pub.towardsai.net/how-to-create-a-python-package-for-fetching-weather-data-b17614627f30 And the corresponding repository: https://github.com/stavrostheocharis/weather_data_retriever submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 48 min )
    Punk Rock Funeral
    Punk Rock Funeral -- made with AIVA and Deforum Stable Diffusion. All AI, all the time! submitted by /u/Enuminous [link] [comments]  ( 46 min )
    Generative AI: From Data Generation to Creative Intelligence
    A common idea that our creativity is what makes us uniquely human has shaped society but strides of progress made in the domain of Generative Artificial Intelligence question this very notion. Generative AI is an emerging field that involves the creation of original content or data using machine learning algorithms. https://medium.com/@agrawal.sannidhya26/generative-ai-from-data-generation-to-creative-intelligence-50ed7bc13768 Feel free to give it a quick glance and help me grow and learn, click on the clap icon a few times if you appreciate the effort. submitted by /u/sannidhya26 [link] [comments]  ( 54 min )
    New AI features alert!
    New AI features alert! https://bardeen.ai/ai You no longer need to know how to build complicated automations or spend hours creating them. Bardeen will generate a custom automation for you when it detects manual tasks. Or you can type things like "transfer all my Google Sheet data to Notion", or "email all meeting participants a meeting summary", and Bardeen will generate automations for you. You can review, edit and activate it within a few clicks. submitted by /u/Intelligent_Shop_012 [link] [comments]  ( 46 min )
    Join us tomorrow at 6pm EST for a presentation covering the recent history of NLP leading up to and including ChatGPT, followed by a discussion session! Hosted on the Learn AI Together Discord (free)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 46 min )
    Microsoft In Talks To Invest An Additional $10 Billion Into OpenAI
    submitted by /u/liquidocelotYT [link] [comments]  ( 45 min )
    ChatGPT and VR - Changing the Way we Learn Soft Skills
    submitted by /u/Iza2022 [link] [comments]  ( 49 min )
    My free 100 page non-technical book about the consequences of AI in society, employment, etc...feedback and collaborations welcome
    submitted by /u/ronin_khan [link] [comments]  ( 54 min )
    I wrote about 100+ Tools in my Newsletter: Here is a full List of all Tools
    submitted by /u/Ava-AI [link] [comments]  ( 51 min )
    The First AI Generated Beats
    submitted by /u/BoysenberryCandid181 [link] [comments]  ( 51 min )
    Do you think AI can really replace a person?
    submitted by /u/taniazhydkova [link] [comments]  ( 48 min )
    AI Being Used to Pinpoint the Most Beneficial Therapeutic Molecules in Psychedelics
    submitted by /u/secret-millionaire [link] [comments]  ( 45 min )
    So, I asked for a song about AI and I recorded it
    submitted by /u/Sladix [link] [comments]  ( 50 min )
    from a human motion sequence, SUMMON synthesizes physically plausible and semantically reasonable objects
    submitted by /u/SpatialComputing [link] [comments]  ( 47 min )
    What is ChatGPT Professional?
    submitted by /u/BackgroundResult [link] [comments]  ( 46 min )
    Creating a short film using AI ! - Looking for a team that wants to help me finish it :)
    submitted by /u/sebaschapela [link] [comments]  ( 54 min )
  • Open

    SJC A Text-to-3D Algorithm using SD
    submitted by /u/oridnary_artist [link] [comments]  ( 51 min )
    How to classify audio using deep learning and Tensorflow hub?
    https://preview.redd.it/6358imw34oba1.png?width=1280&format=png&auto=webp&s=fc68b3fb7e3768517cef1260a9786f4e062f5ed3 Tensorflow Hub has cool pre-trained models. One of the is audio and sound classification. Imagine you have a sound , and would like to detect if it a sound of a cat , or a sound of water , or maybe to classify music ….. So , this model is a cool way of classify your own audio files. Before we continue , I actually recommend this book for deep learning based on Tensorflow and Keras : https://amzn.to/3STWZ2N So, in this tutorial we will learn how to use this tensor hub model on your own audio files . The link for the video tutorial is here : https://youtu.be/_iX0VRp7UEA I also shared the Python instructions to my Github repo in the video description. Enjoy Eran #Python #Cnn #TensorFlow #deeplearning #tensorflowhub submitted by /u/Feitgemel [link] [comments]  ( 52 min )
    Deep Learning Pioneer Geoffrey Hinton Publishes New Deep Learning Algorithm
    submitted by /u/nickb [link] [comments]  ( 57 min )
    Looking for someone with good NN/ deep learning experience for a paid project
    Hello all, I'm looking for someone (1 man, team, doesn't matter) that can make a real estate related project. The project itself: a NN that you can give a document regarding some house/ apartment and based on the document the NN should give out an estimated price/ price range. So, you get a document with pics (from which the NN should determine if and how well its furnished, and its current state: brand new, used, old and broken, etc.), livable surface (how many square meters/ m2 it has, how many m2 each room has), address, if it's furnished or not, etc. and the NN should somehow check all other similar housings in the area/ neighbourhood/ city (online probably, but another NN for data extraction could also be made) and then give an adequate price. I have a friend that wants this implemented and will start looking for funding in 2 days. He asked me to give an estimated deadline and price range so that he knows what he'll be presenting. Any thoughts? Any takers? Edit: I forgot to mention. My friend knows some pretty high people in businesses that provide services to 100s or even 1000s of customers per month, so we won't be talking about breadcrumbs. submitted by /u/CuriousCesarr [link] [comments]  ( 56 min )
  • Open

    2022-23 Takeda Fellows: Leveraging AI to positively impact human health
    New fellows are working on health records, robot control, pandemic preparedness, brain injuries, and more.  ( 9 min )
    Engineering in harmony
    AeroAstro major and accomplished tuba player Frederick Ajisafe relishes the community he has found in the MIT Wind Ensemble.  ( 9 min )
  • Open

    "An Analysis of Quantile Temporal-Difference Learning", Rowland et al 2023 {DM}
    submitted by /u/gwern [link] [comments]  ( 52 min )
    Lux AI and Halite like challenges to run locally at an event?
    Hi guys! I don't know where to ask this, but i guess someone here could help me out with that. Are there any challenges like Lux AI (https://www.lux-ai.org/) and Halite (https://www.kaggle.com/c/halite) that I can run locally and make a challenge for the participants of a small event? I wanted something simple and that can be done by people of all skills (but all have a background in programming), and that can be written in a short time (about 2 hours). It also doesn't have to be an AI challenge, but I think these ones look fun do to. Thanks for hte help!! submitted by /u/HalTeaS [link] [comments]  ( 51 min )
    New Continual Learning Subreddit
    submitted by /u/Independent-Law1791 [link] [comments]  ( 52 min )
    Test environments for non image based problems?
    Procgen is a fantastic resource for testing the agent on a novel environment. Does the same resource exist for non-image based environment such as CartPole, etc? submitted by /u/Academic-Rent7800 [link] [comments]  ( 51 min )
    Has anyone here applied Reinforcement Learning with Human Feedback on a project?
    There's a lot of hype around RLHF due to its use towards ChatGPT. But I can't find many other cases where it's truly been used in the wild by people trying to tune open-source models, or their own proprietary ones. Does anyone have examples of RLHF where they've seen it applied? Or examples of doing it themselves? Thank you! submitted by /u/fourcornerclub [link] [comments]  ( 53 min )
    If statements in the reset function of an openAI gym environment?
    In my custom openAI gym environment, a simulator is launched and data collected as the state. I want an episode to end if there is either a vehicle collision or a successful final state reached. In the case of the collision I want the episode to end and the simulator to be closed and re-opened. Otherwise, I just want to introduce a new controlled vehicle, independent of the previous one. Will using an if statement to implement this in my reset function cause any issues? submitted by /u/centripetalstranger [link] [comments]  ( 55 min )
    NaNs after first fully connected layer
    I'm working on a MARL project. The observation is a (31,1) vector that I first process with a few fully connected layers. Then, the output is sent into a recurrent policy. Now, for some reason, after a few million steps of training, the observation gets sent into the first FC and becomes a matrix of NaNs. I checked and there are no NaNs in the observation. Example of the observation from the last crash: ​ ``` tensor([[ 2.8740e-02, 2.2078e-02, 1.9542e-02, ..., -3.3949e-01, 6.2327e-02, -2.8951e-04], [ 4.0109e-02, 2.2649e-02, 2.0599e-02, ..., -3.3947e-01, 5.5702e-02, -5.4328e-05], [ 5.1799e-02, 2.3269e-02, 2.1813e-02, ..., -3.4162e-01, 5.3501e-02, -8.1255e-04], ..., [ 1.7621e-01, 2.1108e-03, 1.4367e-02, ..., -3.4072e-01, 4.2021e-02, -1.3159e-02], [ 1.7600e-01, -2.2701e-05, 1.2215e-02, ..., -3.4045e-01, 4.2869e-02, -1.3915e-02], [ 1.7618e-01, 4.4542e-04, 1.2899e-02, ..., -3.4266e-01, 4.4017e-02, -1.8093e-02]], device='cuda:0') ``` ​ I've tried a few things that did not work: using LeakyReLu instead of ReLu and removing Layer Normalization. ​ Do you have any tips? TL;DR Any ideas on why a fully connected layer that processes the observation outputs NaNs after a few million steps? submitted by /u/No_Possibility_7588 [link] [comments]  ( 52 min )
    "Learning to Play Minecraft with Video PreTraining (VPT)" {OA}
    submitted by /u/gwern [link] [comments]  ( 54 min )
    Google Intrinsic robotics company lays off 20% (40) employees {The Information} (paywall)
    submitted by /u/gwern [link] [comments]  ( 56 min )
  • Open

    Multilingual customer support translation made easy on Salesforce Service Cloud using Amazon Translate
    This post was co-authored with Mark Lott, Distinguished Technical Architect, Salesforce, Inc. Enterprises that operate globally are experiencing challenges sourcing customer support professionals with multi-lingual experience. This process can be cost-prohibitive and difficult to scale, leading many enterprises to only support English for chats. Using human interpreters for translation support is expensive, and infeasible since […]  ( 10 min )
    Redacting PII data at The Very Group with Amazon Comprehend
    This is guest post by Andy Whittle, Principal Platform Engineer – Application & Reliability Frameworks at The Very Group. At The Very Group, which operates digital retailer Very, security is a top priority in handling data for millions of customers. Part of how The Very Group secures and tracks business operations is through activity logging […]  ( 7 min )
  • Open

    Advancing human-centered AI: Updates on responsible AI research
    Artificial intelligence, like all tools we build, is an expression of human creativity. As with all creative expression, AI manifests the perspectives and values of its creators. A stance that encourages reflexivity among AI practitioners is a step toward ensuring that AI systems are human-centered, developed, and deployed with the interests and well-being of individuals and society front and center. This is the focus of research scientists and engineers affiliated with Aether, the advisory council for Microsoft leadership on AI ethics and effects. Central to Aether’s work is the question of who we’re creating AI for—and whether we’re creating AI to solve real problems with responsible solutions. With AI capabilities accelerating, our researchers work to understand the sociotechnical implications and find ways to help on-the-ground practitioners envision and realize these capabilities in line with Microsoft AI principles. The post Advancing human-centered AI: Updates on responsible AI research appeared first on Microsoft Research.  ( 15 min )
  • Open

    Primes with two non-zero bits
    Suppose a number n written in binary has two 1s and all the rest of its bits are zeros. If n is prime, then the 1s must be the first and last bits of n. The first bit is 1 because the first bit of every positive integer is 1. The last bit is 1 […] Primes with two non-zero bits first appeared on John D. Cook.  ( 6 min )
    Certified sonnet primes
    Last week I wrote about primailty certificates. These certificates offer a way to verify that a number is prime using less computation than was used to discover than the number was prime. This post gives a couple more examples of primality certificates using sonnet primes. As described here, These are primes of the form ababcdcdefefgg, […] Certified sonnet primes first appeared on John D. Cook.  ( 4 min )
  • Open

    NVIDIA, Evozyne Create Generative AI Model for Proteins
    Using a pretrained AI model from NVIDIA, startup Evozyne created two proteins with significant potential in healthcare and clean energy. A joint paper released today describes the process and the biological building blocks it produced. One aims to cure a congenital disease, another is designed to consume carbon dioxide to reduce global warming. Initial results Read article >  ( 5 min )
    GFN Thursday Adds New Titles From THQ Nordic to GeForce NOW
    GFN Thursday kicks each weekend off with new games and updates straight from the cloud. This week adds more games from publisher THQ Nordic to the GeForce NOW library, as part seven total additions. Members can gear up to play these new titles the ultimate way with the upcoming release of the new Ultimate membership, Read article >  ( 6 min )
    NVIDIA Helps Retail Industry Tackle Its $100 Billion Shrink Problem
    The global retail industry has a $100 billion problem. “Shrinkage” — the loss of goods due to theft, damage and misplacement — significantly crimps retailers’ profits. An estimated 65% of shrinkage is due to theft, according to the National Retail Federation’s 2022 Retail Security Survey, conducted in partnership with the Loss Prevention Research Council. And Read article >  ( 6 min )
  • Open

    Best Arm Identification in Stochastic Bandits: Beyond $\beta-$optimality. (arXiv:2301.03785v1 [stat.ML])
    This paper focuses on best arm identification (BAI) in stochastic multi-armed bandits (MABs) in the fixed-confidence, parametric setting. In such pure exploration problems, the accuracy of the sampling strategy critically hinges on the sequential allocation of the sampling resources among the arms. The existing approaches to BAI address the following question: what is an optimal sampling strategy when we spend a $\beta$ fraction of the samples on the best arm? These approaches treat $\beta$ as a tunable parameter and offer efficient algorithms that ensure optimality up to selecting $\beta$, hence $\beta-$optimality. However, the BAI decisions and performance can be highly sensitive to the choice of $\beta$. This paper provides a BAI algorithm that is agnostic to $\beta$, dispensing with the need for tuning $\beta$, and specifies an optimal allocation strategy, including the optimal value of $\beta$. Furthermore, the existing relevant literature focuses on the family of exponential distributions. This paper considers a more general setting of any arbitrary family of distributions parameterized by their mean values (under mild regularity conditions).  ( 2 min )
    Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage. (arXiv:2107.06226v4 [cs.LG] UPDATED)
    We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.  ( 2 min )
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v2 [cs.LG] UPDATED)
    Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.  ( 2 min )
    Optimal randomized multilevel Monte Carlo for repeatedly nested expectations. (arXiv:2301.04095v1 [stat.CO])
    The estimation of repeatedly nested expectations is a challenging problem that arises in many real-world systems. However, existing methods generally suffer from high computational costs when the number of nestings becomes large. Fix any non-negative integer $D$ for the total number of nestings. Standard Monte Carlo methods typically cost at least $\mathcal{O}(\varepsilon^{-(2+D)})$ and sometimes $\mathcal{O}(\varepsilon^{-2(1+D)})$ to obtain an estimator up to $\varepsilon$-error. More advanced methods, such as multilevel Monte Carlo, currently only exist for $D = 1$. In this paper, we propose a novel Monte Carlo estimator called $\mathsf{READ}$, which stands for "Recursive Estimator for Arbitrary Depth.'' Our estimator has an optimal computational cost of $\mathcal{O}(\varepsilon^{-2})$ for every fixed $D$ under suitable assumptions, and a nearly optimal computational cost of $\mathcal{O}(\varepsilon^{-2(1 + \delta)})$ for any $0 < \delta < \frac12$ under much more general assumptions. Our estimator is also unbiased, which makes it easy to parallelize. The key ingredients in our construction are an observation of the problem's recursive structure and the recursive use of the randomized multilevel Monte Carlo method.  ( 2 min )
    Mastering Diverse Domains through World Models. (arXiv:2301.04104v1 [cs.AI])
    General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.  ( 2 min )
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v2 [stat.ML] UPDATED)
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions, and we provide sufficient conditions for specific classes of continuous functions to be recourse sensitive. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of $x$, by providing an exact characterization of the functions $f$ to which impossibility applies.  ( 2 min )
    Manifold Restricted Interventional Shapley Values. (arXiv:2301.04041v1 [stat.ML])
    Shapley values are model-agnostic methods for explaining model predictions. Many commonly used methods of computing Shapley values, known as \emph{off-manifold methods}, rely on model evaluations on out-of-distribution input samples. Consequently, explanations obtained are sensitive to model behaviour outside the data distribution, which may be irrelevant for all practical purposes. While \emph{on-manifold methods} have been proposed which do not suffer from this problem, we show that such methods are overly dependent on the input data distribution, and therefore result in unintuitive and misleading explanations. To circumvent these problems, we propose \emph{ManifoldShap}, which respects the model's domain of validity by restricting model evaluations to the data manifold. We show, theoretically and empirically, that ManifoldShap is robust to off-manifold perturbations of the model and leads to more accurate and intuitive explanations than existing state-of-the-art Shapley methods.  ( 2 min )
    Sampling random graph homomorphisms and applications to network data analysis. (arXiv:1910.09483v3 [math.PR] UPDATED)
    A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels.  ( 2 min )
    Calibrated simplex-mapping classification. (arXiv:2103.02926v2 [stat.ML] UPDATED)
    We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular $(n-1)$-dimensional simplex, $n$ being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.
    Adversarial Policies Beat Superhuman Go AIs. (arXiv:2211.00241v2 [cs.LG] UPDATED)
    We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.
    A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v1 [cs.LG])
    We present a theory of ensemble diversity, explaining the nature and effect of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the holy grail of ensemble learning, an open question for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of an ensemble. In particular, we prove a family of exact bias-variance-diversity decompositions, for both classification and regression losses, e.g., squared, and cross-entropy. The framework provides a methodology to automatically identify the combiner rule enabling such a decomposition, specific to the loss. The formulation of diversity is therefore dependent on just two design choices: the loss, and the combiner. For certain choices (e.g., 0-1 loss with majority voting) the effect of diversity is necessarily dependent on the target label. Experiments illustrate how we can use our framework to understand the diversity-encouraging mechanisms of popular ensemble methods: Bagging, Boosting, and Random Forests.
    Semiparametric Regression for Spatial Data via Deep Learning. (arXiv:2301.03747v1 [stat.ML])
    In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.
    Markovian Sliced Wasserstein Distances: Beyond Independent Projections. (arXiv:2301.03749v1 [stat.ML])
    Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of MSW.
    HierarchicalForecast: A Reference Framework for Hierarchical Forecasting in Python. (arXiv:2207.03517v4 [stat.ML] UPDATED)
    Large collections of time series data are commonly organized into structures with different levels of aggregation; examples include product and geographical groupings. It is often important to ensure that the forecasts are coherent so that the predicted values at disaggregate levels add up to the aggregate forecast. The growing interest of the Machine Learning community in hierarchical forecasting systems indicates that we are in a propitious moment to ensure that scientific endeavors are grounded on sound baselines. For this reason, we put forward the HierarchicalForecast library, which contains preprocessed publicly available datasets, evaluation metrics, and a compiled set of statistical baseline models. Our Python-based reference framework aims to bridge the gap between statistical and econometric modeling, and Machine Learning forecasting research. Code and documentation are available in https://github.com/Nixtla/hierarchicalforecast.
    Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials. (arXiv:2301.03655v1 [stat.ML])
    We propose a Bayesian tensor regression model to accommodate the effect of multiple factors on phenotype prediction. We adopt a set of prior distributions that resolve identifiability issues that may arise between the parameters in the model. Simulation experiments show that our method out-performs previous related models and machine learning algorithms under different sample sizes and degrees of complexity. We further explore the applicability of our model by analysing real-world data related to wheat production across Ireland from 2010 to 2019. Our model performs competitively and overcomes key limitations found in other analogous approaches. Finally, we adapt a set of visualisations for the posterior distribution of the tensor effects that facilitate the identification of optimal interactions between the tensor variables whilst accounting for the uncertainty in the posterior distribution.
    Community Detection with Known, Unknown, or Partially Known Auxiliary Latent Variables. (arXiv:2301.04088v1 [cs.SI])
    Empirical observations suggest that in practice, community membership does not completely explain the dependency between the edges of an observation graph. The residual dependence of the graph edges are modeled in this paper, to first order, by auxiliary node latent variables that affect the statistics of the graph edges but carry no information about the communities of interest. We then study community detection in graphs obeying the stochastic block model and censored block model with auxiliary latent variables. We analyze the conditions for exact recovery when these auxiliary latent variables are unknown, representing unknown nuisance parameters or model mismatch. We also analyze exact recovery when these secondary latent variables have been either fully or partially revealed. Finally, we propose a semidefinite programming algorithm for recovering the desired labels when the secondary labels are either known or unknown. We show that exact recovery is possible by semidefinite programming down to the respective maximum likelihood exact recovery threshold.  ( 2 min )
  • Open

    Goal Misgeneralization in Deep Reinforcement Learning. (arXiv:2105.14111v7 [cs.LG] UPDATED)
    We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.  ( 2 min )
    Distributed Sparse Linear Regression under Communication Constraints. (arXiv:2301.04022v1 [cs.LG])
    In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.  ( 2 min )
    The troublesome kernel -- On hallucinations, no free lunches and the accuracy-stability trade-off in inverse problems. (arXiv:2001.01258v2 [cs.LG] UPDATED)
    Methods inspired by Artificial Intelligence (AI) are starting to fundamentally change computational science and engineering through breakthrough performances on challenging problems. However, reliability and trustworthiness of such techniques is becoming a major concern. In inverse problems in imaging, the focus of this paper, there is increasing empirical evidence that methods may suffer from hallucinations, i.e., false, but realistic-looking artifacts; instability, i.e., sensitivity to perturbations in the data; and unpredictable generalization, i.e., excellent performance on some images, but significant deterioration on others. This paper presents a theoretical foundation for these phenomena. We give a mathematical framework describing how and when such effects arise in arbitrary reconstruction methods, not just AI-inspired techniques. Several of our results take the form of 'no free lunch' theorems. Specifically, we show that (i) methods that overperform on a single image can wrongly transfer details from one image to another, creating a hallucination, (ii) methods that overperform on two or more images can hallucinate or be unstable, (iii) optimizing the accuracy-stability trade-off is generally difficult, (iv) hallucinations and instabilities, if they occur, are not rare events, and may be encouraged by standard training, (v) it may be impossible to construct optimal reconstruction maps for certain problems, (vi) standard methods to improve reliability (e.g., regularization or adversarial training) may themselves lead to unstable problems. Our results trace these effects to the kernel of the forwards operator. They assert that such effects can be avoided only if information about the kernel is encoded into the reconstruction procedure. Based on this, this work aims to spur research into new ways to develop robust and reliable AI-inspired methods for inverse problems in imaging.  ( 3 min )
    VeriX: Towards Verified Explainability of Deep Neural Networks. (arXiv:2212.01051v3 [cs.LG] UPDATED)
    We present VeriX, a system for producing optimal robust explanations (La Malfa et al. 2021) for machine learning models. We build robust explanations iteratively using constraint solving techniques and a heuristic based on feature-level sensitivity ranking. We evaluate our approach on image recognition benchmarks and a real-world scenario of autonomous aircraft taxiing.  ( 2 min )
    Adversarial Policies Beat Superhuman Go AIs. (arXiv:2211.00241v2 [cs.LG] UPDATED)
    We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.  ( 2 min )
    Structural risk minimization for quantum linear classifiers. (arXiv:2105.05566v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) models based on parameterized quantum circuits are often highlighted as candidates for quantum computing's near-term ``killer application''. However, the understanding of the empirical and generalization performance of these models is still in its infancy. In this paper we study how to balance between training accuracy and generalization performance (also called structural risk minimization) for two prominent QML models introduced by Havl\'{i}\v{c}ek et al. (Nature, 2019), and Schuld and Killoran (PRL, 2019). Firstly, using relationships to well understood classical models, we prove that two model parameters -- i.e., the dimension of the sum of the images and the Frobenius norm of the observables used by the model -- closely control the models' complexity and therefore its generalization performance. Secondly, using ideas inspired by process tomography, we prove that these model parameters also closely control the models' ability to capture correlations in sets of training examples. In summary, our results give rise to new options for structural risk minimization for QML models.  ( 2 min )
    Differentiable, learnable, regionalized process-based models with physical outputs can approach state-of-the-art hydrologic prediction accuracy. (arXiv:2203.14827v2 [cs.LG] UPDATED)
    Predictions of hydrologic variables across the entire water cycle have significant value for water resource management as well as downstream applications such as ecosystem and water quality modeling. Recently, purely data-driven deep learning models like long short-term memory (LSTM) showed seemingly-insurmountable performance in modeling rainfall-runoff and other geoscientific variables, yet they cannot predict untrained physical variables and remain challenging to interpret. Here we show that differentiable, learnable, process-based models (called {\delta} models here) can approach the performance level of LSTM for the intensively-observed variable (streamflow) with regionalized parameterization. We use a simple hydrologic model HBV as the backbone and use embedded neural networks, which can only be trained in a differentiable programming framework, to parameterize, enhance, or replace the process-based model modules. Without using an ensemble or post-processor, {\delta} models can obtain a median Nash Sutcliffe efficiency of 0.732 for 671 basins across the USA for the Daymet forcing dataset, compared to 0.748 from a state-of-the-art LSTM model with the same setup. For another forcing dataset, the difference is even smaller: 0.715 vs. 0.722. Meanwhile, the resulting learnable process-based models can output a full set of untrained variables, e.g., soil and groundwater storage, snowpack, evapotranspiration, and baseflow, and later be constrained by their observations. Both simulated evapotranspiration and fraction of discharge from baseflow agreed decently with alternative estimates. The general framework can work with models with various process complexity and opens up the path for learning physics from big data.  ( 2 min )
    Partial order: Finding Consensus among Uncertain Feature Attributions. (arXiv:2110.13369v2 [cs.LG] UPDATED)
    Post-hoc feature attribution methods are progressively being employed to explain decisions of complex machine learning models. Yet, it is possible for practitioners to obtain a diversity of models that provide very different explanations to the same prediction, making it hard to derive insight from them. In this work, instead of aiming at reducing the under-specification of model explanations, we fully embrace it and extract logical statements about feature attributions that are consistent across multiple models with good performance. We show that a partial order of feature importance arises from this methodology enabling more nuanced explanations by allowing pairs of features to be incomparable when there is no consensus on their relative importance. We prove that every relation among features present in these partial order also holds in the rankings provided by existing approaches. Finally, we present use cases on three datasets where partial orders allow one to extract knowledge from models despite their under-specification.  ( 2 min )
    Combinatorial Pure Exploration of Causal Bandits. (arXiv:2206.07883v2 [cs.LG] UPDATED)
    The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -- the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.  ( 2 min )
    Differentiable modeling to unify machine learning and physical models and advance Geosciences. (arXiv:2301.04027v1 [cs.LG])
    Process-Based Modeling (PBM) and Machine Learning (ML) are often perceived as distinct paradigms in the geosciences. Here we present differentiable geoscientific modeling as a powerful pathway toward dissolving the perceived barrier between them and ushering in a paradigm shift. For decades, PBM offered benefits in interpretability and physical consistency but struggled to efficiently leverage large datasets. ML methods, especially deep networks, presented strong predictive skills yet lacked the ability to answer specific scientific questions. While various methods have been proposed for ML-physics integration, an important underlying theme -- differentiable modeling -- is not sufficiently recognized. Here we outline the concepts, applicability, and significance of differentiable geoscientific modeling (DG). "Differentiable" refers to accurately and efficiently calculating gradients with respect to model variables, critically enabling the learning of high-dimensional unknown relationships. DG refers to a range of methods connecting varying amounts of prior knowledge to neural networks and training them together, capturing a different scope than physics-guided machine learning and emphasizing first principles. Preliminary evidence suggests DG offers better interpretability and causality than ML, improved generalizability and extrapolation capability, and strong potential for knowledge discovery, while approaching the performance of purely data-driven ML. DG models require less training data while scaling favorably in performance and efficiency with increasing amounts of data. With DG, geoscientists may be better able to frame and investigate questions, test hypotheses, and discover unrecognized linkages.  ( 2 min )
    ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. (arXiv:2202.11271v3 [cs.RO] UPDATED)
    Robotic navigation has been approached as a problem of 3D reconstruction and planning, as well as an end-to-end learning problem. However, long-range navigation requires both planning and reasoning about local traversability, as well as being able to utilize general knowledge about global geography, in the form of a roadmap, GPS, or other side information providing important cues. In this work, we propose an approach that integrates learning and planning, and can utilize side information such as schematic roadmaps, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate. Our method, ViKiNG, incorporates a local traversability model, which looks at the robot's current camera observation and a potential subgoal to infer how easily that subgoal can be reached, as well as a heuristic model, which looks at overhead maps for hints and attempts to evaluate the appropriateness of these subgoals in order to reach the goal. These models are used by a heuristic planner to identify the best waypoint in order to reach the final destination. Our method performs no explicit geometric reconstruction, utilizing only a topological representation of the environment. Despite having never seen trajectories longer than 80 meters in its training dataset, ViKiNG can leverage its image-based learned controller and goal-directed heuristic to navigate to goals up to 3 kilometers away in previously unseen environments, and exhibit complex behaviors such as probing potential paths and backtracking when they are found to be non-viable. ViKiNG is also robust to unreliable maps and GPS, since the low-level controller ultimately makes decisions based on egocentric image observations, using maps only as planning heuristics. For videos of our experiments, please check out our project page https://sites.google.com/view/viking-release.  ( 3 min )
    Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage. (arXiv:2107.06226v4 [cs.LG] UPDATED)
    We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.  ( 2 min )
    Mastering Diverse Domains through World Models. (arXiv:2301.04104v1 [cs.AI])
    General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.  ( 2 min )
    Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum. (arXiv:2206.10057v2 [cs.LG] UPDATED)
    Despite considerable advances in deep reinforcement learning, it has been shown to be highly vulnerable to adversarial perturbations to state observations. Recent efforts that have attempted to improve adversarial robustness of reinforcement learning can nevertheless tolerate only very small perturbations, and remain fragile as perturbation size increases. We propose Bootstrapped Opportunistic Adversarial Curriculum Learning (BCL), a novel flexible adversarial curriculum learning framework for robust reinforcement learning. Our framework combines two ideas: conservatively bootstrapping each curriculum phase with highest quality solutions obtained from multiple runs of the previous phase, and opportunistically skipping forward in the curriculum. In our experiments we show that the proposed BCL framework enables dramatic improvements in robustness of learned policies to adversarial perturbations. The greatest improvement is for Pong, where our framework yields robustness to perturbations of up to 25/255; in contrast, the best existing approach can only tolerate adversarial noise up to 5/255. Our code is available at: https://github.com/jlwu002/BCL.  ( 2 min )
    IronForge: An Open, Secure, Fair, Decentralized Federated Learning. (arXiv:2301.04006v1 [cs.LG])
    Federated learning (FL) provides an effective machine learning (ML) architecture to protect data privacy in a distributed manner. However, the inevitable network asynchrony, the over-dependence on a central coordinator, and the lack of an open and fair incentive mechanism collectively hinder its further development. We propose \textsc{IronForge}, a new generation of FL framework, that features a Directed Acyclic Graph (DAG)-based data structure and eliminates the need for central coordinators to achieve fully decentralized operations. \textsc{IronForge} runs in a public and open network, and launches a fair incentive mechanism by enabling state consistency in the DAG, so that the system fits in networks where training resources are unevenly distributed. In addition, dedicated defense strategies against prevalent FL attacks on incentive fairness and data privacy are presented to ensure the security of \textsc{IronForge}. Experimental results based on a newly developed testbed FLSim highlight the superiority of \textsc{IronForge} to the existing prevalent FL frameworks under various specifications in performance, fairness, and security. To the best of our knowledge, \textsc{IronForge} is the first secure and fully decentralized FL framework that can be applied in open networks with realistic network and training settings.  ( 2 min )
    Vision Transformers Are Good Mask Auto-Labelers. (arXiv:2301.03992v1 [cs.CV])
    We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4\% performance of fully supervised models. The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.  ( 2 min )
    FOLD-SE: An Efficient Rule-based Machine Learning Algorithm with Scalable Explainability. (arXiv:2208.07912v2 [cs.LG] UPDATED)
    We present FOLD-SE, an efficient, explainable machine learning algorithm for classification tasks given tabular data containing numerical and categorical values. FOLD-SE generates a set of default rules-essentially a stratified normal logic program-as an (explainable) trained model. Explainability provided by FOLD-SE is scalable, meaning that regardless of the size of the dataset, the number of learned rules and learned literals stay quite small while good accuracy in classification is maintained. A model with smaller number of rules and literals is easier to understand for human beings. FOLD-SE is competitive with state-of-the-art machine learning algorithms such as XGBoost and Multi-Layer Perceptrons (MLP) wrt accuracy of prediction. However, unlike XGBoost and MLP, the FOLD-SE algorithm is explainable. The FOLD-SE algorithm builds upon our earlier work on developing the explainable FOLD-R++ machine learning algorithm for binary classification and inherits all of its positive features. Thus, pre-processing of the dataset, using techniques such as one-hot encoding, is not needed. Like FOLD-R++, FOLD-SE uses prefix sum to speed up computations resulting in FOLD-SE being an order of magnitude faster than XGBoost and MLP in execution speed. The FOLD-SE algorithm outperforms FOLD-R++ as well as other rule-learning algorithms such as RIPPER in efficiency, performance and scalability, especially for large datasets. A major reason for scalable explainability of FOLD-SE is the use of a literal selection heuristics based on Gini Impurity, as opposed to Information Gain used in FOLD-R++. A multi-category classification version of FOLD-SE is also presented.  ( 2 min )
    Smart Application for Fall Detection Using Wearable ECG & Accelerometer Sensors. (arXiv:2207.00008v2 [cs.HC] UPDATED)
    Timely and reliable detection of falls is a large and rapidly growing field of research due to the medical and financial demand of caring for a constantly growing elderly population. Within the past 2 decades, the availability of high-quality hardware (high-quality sensors and AI microchips) and software (machine learning algorithms) technologies has served as a catalyst for this research by giving developers the capabilities to develop such systems. This study developed multiple application components in order to investigate the development challenges and choices for fall detection systems, and provide materials for future research. The smart application developed using this methodology was validated by the results from fall detection modelling experiments and model mobile deployment. The best performing model overall was the ResNet152 on a standardised, and shuffled dataset with a 2s window size which achieved 92.8% AUC, 87.28% sensitivity, and 98.33% specificity. Given these results it is evident that accelerometer and ECG sensors are beneficial for fall detection, and allow for the discrimination between falls and other activities. This study leaves a significant amount of room for improvement due to weaknesses identified in the resultant dataset. These improvements include using a labelling protocol for the critical phase of a fall, increasing the number of dataset samples, improving the test subject representation, and experimenting with frequency domain preprocessing.  ( 2 min )
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v2 [stat.ML] UPDATED)
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions, and we provide sufficient conditions for specific classes of continuous functions to be recourse sensitive. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of $x$, by providing an exact characterization of the functions $f$ to which impossibility applies.  ( 2 min )
    Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice. (arXiv:2210.03709v3 [cs.HC] UPDATED)
    Recent years have seen growing interest among both researchers and practitioners in user-engaged approaches to algorithm auditing, which directly engage users in detecting problematic behaviors in algorithmic systems. However, we know little about industry practitioners' current practices and challenges around user-engaged auditing, nor what opportunities exist for them to better leverage such approaches in practice. To investigate, we conducted a series of interviews and iterative co-design activities with practitioners who employ user-engaged auditing approaches in their work. Our findings reveal several challenges practitioners face in appropriately recruiting and incentivizing user auditors, scaffolding user audits, and deriving actionable insights from user-engaged audit reports. Furthermore, practitioners shared organizational obstacles to user-engaged auditing, surfacing a complex relationship between practitioners and user auditors. Based on these findings, we discuss opportunities for future HCI research to help realize the potential (and the mitigate risks) of user-engaged auditing in industry practice.  ( 2 min )
    ELIAS: End-to-End Learning to Index and Search in Large Output Spaces. (arXiv:2210.08410v2 [cs.LG] UPDATED)
    Extreme multi-label classification (XMC) is a popular framework for solving many real-world problems that require accurate prediction from a very large number of potential output choices. A popular approach for dealing with the large label space is to arrange the labels into a shallow tree-based index and then learn an ML model to efficiently search this index via beam search. Existing methods initialize the tree index by clustering the label space into a few mutually exclusive clusters based on pre-defined features and keep it fixed throughout the training procedure. This approach results in a sub-optimal indexing structure over the label space and limits the search performance to the quality of choices made during the initialization of the index. In this paper, we propose a novel method ELIAS which relaxes the tree-based index to a specialized weighted graph-based index which is learned end-to-end with the final task objective. More specifically, ELIAS models the discrete cluster-to-label assignments in the existing tree-based index as soft learnable parameters that are learned jointly with the rest of the ML model. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels. In particular, ELIAS can be up to 2.5% better at precision@1 and up to 4% better at recall@100 than existing XMC methods. A PyTorch implementation of ELIAS along with other resources is available at https://github.com/nilesh2797/ELIAS.
    Generating Accurate and Faithful Discharge Instructions: Task, Dataset, and Model. (arXiv:2210.12777v2 [cs.CL] UPDATED)
    The "Patient Instruction" (PI), known as "Discharge Instruction", which contains critical instructional information provided both to carers and to the patient at the time of discharge, is essential for the patient to manage their condition outside hospital. An accurate and easy-to-follow PI can improve the self-management of patients which can in turn reduce hospital readmission rates. However, writing an appropriate PI can be extremely time-consuming for physicians, and is subject to being incomplete or error-prone for (potentially overworked) physicians. Therefore, we propose a new task that can provide an objective means of avoiding incompleteness, while reducing clinical workload: the automatic generation of the PI, which is imagined as being a document that the clinician can review, modify, and approve as necessary (rather than taking the human "out of the loop"). We build a benchmark clinical dataset and propose the Re3Writer, which imitates the working patterns of physicians to first retrieve related working experience from historical PIs written by physicians, then reason related medical knowledge. Finally, it refines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the PI for previously-unseen patient according to their health records during hospitalization. Our experiments show that, using our method, the performance of five different models can be substantially boosted across all metrics, with up to 20%, 11%, and 19% relative improvements in BLEU-4, ROUGE-L, and METEOR, respectively. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of its usefulness for clinical practice. The code is available at https://github.com/AI-in-Hospitals/Patient-Instructions
    Bias-Aware Face Mask Detection Dataset. (arXiv:2211.01207v3 [cs.CV] UPDATED)
    In December 2019, a novel coronavirus (COVID-19) spread so quickly around the world that many countries had to set mandatory face mask rules in public areas to reduce the transmission of the virus. To monitor public adherence, researchers aimed to rapidly develop efficient systems that can detect faces with masks automatically. However, the lack of representative and novel datasets proved to be the biggest challenge. Early attempts to collect face mask datasets did not account for potential race, gender, and age biases. Therefore, the resulting models show inherent biases toward specific race groups, such as Asian or Caucasian. In this work, we present a novel face mask detection dataset that contains images posted on Twitter during the pandemic from around the world. Unlike previous datasets, the proposed Bias-Aware Face Mask Detection (BAFMD) dataset contains more images from underrepresented race and age groups to mitigate the problem for the face mask detection task. We perform experiments to investigate potential biases in widely used face mask detection datasets and illustrate that the BAFMD dataset yields models with better performance and generalization ability. The dataset is publicly available at https://github.com/Alpkant/BAFMD.
    Stars: Tera-Scale Graph Building for Clustering and Graph Learning. (arXiv:2212.02635v2 [cs.LG] UPDATED)
    A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.
    Dynamic Tensor Product Regression. (arXiv:2210.03961v2 [cs.DS] UPDATED)
    In this work, we initiate the study of \emph{Dynamic Tensor Product Regression}. One has matrices $A_1\in \mathbb{R}^{n_1\times d_1},\ldots,A_q\in \mathbb{R}^{n_q\times d_q}$ and a label vector $b\in \mathbb{R}^{n_1\ldots n_q}$, and the goal is to solve the regression problem with the design matrix $A$ being the tensor product of the matrices $A_1, A_2, \dots, A_q$ i.e. $\min_{x\in \mathbb{R}^{d_1\ldots d_q}}~\|(A_1\otimes \ldots\otimes A_q)x-b\|_2$. At each time step, one matrix $A_i$ receives a sparse change, and the goal is to maintain a sketch of the tensor product $A_1\otimes\ldots \otimes A_q$ so that the regression solution can be updated quickly. Recomputing the solution from scratch for each round is very slow and so it is important to develop algorithms which can quickly update the solution with the new design matrix. Our main result is a dynamic tree data structure where any update to a single matrix can be propagated quickly throughout the tree. We show that our data structure can be used to solve dynamic versions of not only Tensor Product Regression, but also Tensor Product Spline regression (which is a generalization of ridge regression) and for maintaining Low Rank Approximations for the tensor product.
    How Far Should We Look Back to Achieve Effective Real-Time Time-Series Anomaly Detection?. (arXiv:2102.06560v6 [cs.LG] UPDATED)
    Anomaly detection is the process of identifying unexpected events or ab-normalities in data, and it has been applied in many different areas such as system monitoring, fraud detection, healthcare, intrusion detection, etc. Providing real-time, lightweight, and proactive anomaly detection for time series with neither human intervention nor domain knowledge could be highly valuable since it reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous event occurs. To our knowledge, RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all above-mentioned features. To achieve real-time and lightweight detection, RePAD utilizes Long Short-Term Memory (LSTM) to detect whether or not each upcoming data point is anomalous based on short-term historical data points. However, it is unclear that how different amounts of historical data points affect the performance of RePAD. Therefore, in this paper, we investigate the impact of different amounts of historical data on RePAD by introducing a set of performance metrics that cover novel detection accuracy measures, time efficiency, readiness, and resource consumption, etc. Empirical experiments based on real-world time series datasets are conducted to evaluate RePAD in different scenarios, and the experimental results are presented and discussed.
    Improving Scheduled Sampling with Elastic Weight Consolidation for Neural Machine Translation. (arXiv:2109.06308v3 [cs.CL] UPDATED)
    Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. the discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and empirically successful approach which addresses this issue by incorporating model-generated prefixes into training. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that scheduled sampling, while it ameliorates exposure bias by increasing model reliance on the input sequence, worsens performance when the prefix at inference time is correct, a form of catastrophic forgetting. We propose to use Elastic Weight Consolidation to better balance mitigating exposure bias with retaining performance. Experiments on four IWSLT'14 and WMT'14 translation datasets demonstrate that our approach alleviates catastrophic forgetting and significantly outperforms maximum likelihood estimation and scheduled sampling baselines.
    ProxyBO: Accelerating Neural Architecture Search via Bayesian Optimization with Zero-cost Proxies. (arXiv:2110.10423v3 [cs.LG] UPDATED)
    Designing neural architectures requires immense manual efforts. This has promoted the development of neural architecture search (NAS) to automate the design. While previous NAS methods achieve promising results but run slowly, zero-cost proxies run extremely fast but are less promising. Therefore, it is of great potential to accelerate NAS via those zero-cost proxies. The existing method has two limitations, which are unforeseeable reliability and one-shot usage. To address the limitations, we present ProxyBO, an efficient Bayesian optimization (BO) framework that utilizes the zero-cost proxies to accelerate neural architecture search. We apply the generalization ability measurement to estimate the fitness of proxies on the task during each iteration and design a novel acquisition function to combine BO with zero-cost proxies based on their dynamic influence. Extensive empirical studies show that ProxyBO consistently outperforms competitive baselines on five tasks from three public benchmarks. Concretely, ProxyBO achieves up to 5.41x and 3.86x speedups over the state-of-the-art approaches REA and BRP-NAS.
    Adaptive Data Debiasing through Bounded Exploration. (arXiv:2110.13054v2 [cs.LG] UPDATED)
    Biases in existing datasets used to train algorithmic decision rules can raise ethical and economic concerns due to the resulting disparate treatment of different groups. We propose an algorithm for sequentially debiasing such datasets through adaptive and bounded exploration in a classification problem with costly and censored feedback. Exploration in this context means that at times, and to a judiciously-chosen extent, the decision maker deviates from its (current) loss-minimizing rule, and instead accepts some individuals that would otherwise be rejected, so as to reduce statistical data biases. Our proposed algorithm includes parameters that can be used to balance between the ultimate goal of removing data biases -- which will in turn lead to more accurate and fair decisions, and the exploration risks incurred to achieve this goal. We analytically show that such exploration can help debias data in certain distributions. We further investigate how fairness criteria can work in conjunction with our data debiasing algorithm. We illustrate the performance of our algorithm using experiments on synthetic and real-world datasets.
    HierarchicalForecast: A Reference Framework for Hierarchical Forecasting in Python. (arXiv:2207.03517v4 [stat.ML] UPDATED)
    Large collections of time series data are commonly organized into structures with different levels of aggregation; examples include product and geographical groupings. It is often important to ensure that the forecasts are coherent so that the predicted values at disaggregate levels add up to the aggregate forecast. The growing interest of the Machine Learning community in hierarchical forecasting systems indicates that we are in a propitious moment to ensure that scientific endeavors are grounded on sound baselines. For this reason, we put forward the HierarchicalForecast library, which contains preprocessed publicly available datasets, evaluation metrics, and a compiled set of statistical baseline models. Our Python-based reference framework aims to bridge the gap between statistical and econometric modeling, and Machine Learning forecasting research. Code and documentation are available in https://github.com/Nixtla/hierarchicalforecast.
    AniWho : A Quick and Accurate Way to Classify Anime Character Faces in Images. (arXiv:2208.11012v3 [cs.CV] UPDATED)
    In order to classify Japanese animation-style character faces, this paper attempts to delve further into the many models currently available, including InceptionV3, InceptionResNetV2, MobileNetV2, and EfficientNet, employing transfer learning. This paper demonstrates that EfficientNet-B7, which achieves a top-1 accuracy of 85.08%, has the highest accuracy rate. MobileNetV2, which achieves a less accurate result with a top-1 accuracy of 81.92%, benefits from a significantly faster inference time and fewer required parameters. However, from the experiment, MobileNet-V2 is prone to overfitting; EfficienNet-B0 fixed the overfitting issue but with a cost of a little slower in inference time than MobileNet-V2 but a little more accurate result, top-1 accuracy of 83.46%. This paper also uses a few-shot learning architecture called Prototypical Networks, which offers an adequate substitute for conventional transfer learning techniques.
    MixGen: A New Multi-Modal Data Augmentation. (arXiv:2206.08358v3 [cs.CV] UPDATED)
    Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual reasoning (+$0.9% on NLVR2), visual question answering (+0.3% on VQA2.0), and visual entailment (+0.4% on SNLI-VE).
    Toward a `Standard Model' of Machine Learning. (arXiv:2108.07783v2 [cs.LG] UPDATED)
    Machine learning (ML) is about computational methods that enable machines to learn concepts from experience. In handling a wide variety of experience ranging from data instances, knowledge, constraints, to rewards, adversaries, and lifelong interaction in an ever-growing spectrum of tasks, contemporary ML/AI (artificial intelligence) research has resulted in a multitude of learning paradigms and methodologies. Despite the continual progresses on all different fronts, the disparate narrowly focused methods also make standardized, composable, and reusable development of ML approaches difficult, and preclude the opportunity to build AI agents that panoramically learn from all types of experience. This article presents a standardized ML formalism, in particular a `standard equation' of the learning objective, that offers a unifying understanding of many important ML algorithms in the supervised, unsupervised, knowledge-constrained, reinforcement, adversarial, and online learning paradigms, respectively -- those diverse algorithms are encompassed as special cases due to different choices of modeling components. The framework also provides guidance for mechanical design of new ML approaches and serves as a promising vehicle toward panoramic machine learning with all experience.
    Discriminator-Guided Model-Based Offline Imitation Learning. (arXiv:2207.00244v3 [cs.LG] UPDATED)
    Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.
    A Decomposition-Based Hybrid Ensemble CNN Framework for Driver Fatigue Recognition. (arXiv:2203.09477v2 [eess.SP] UPDATED)
    Electroencephalogram (EEG) has become increasingly popular in driver fatigue monitoring systems. Several decomposition methods have been attempted to analyze the EEG signals that are complex, nonlinear and non-stationary and improve the EEG decoding performance in different applications. However, it remains challenging to extract more distinguishable features from different decomposed components for driver fatigue recognition. In this work, we propose a novel decomposition-based hybrid ensemble convolutional neural network (CNN) framework to enhance the capability of decoding EEG signals. Four decomposition methods are employed to disassemble the EEG signals into components of different complexity. Instead of handcraft features, the CNNs in this framework directly learn from the decomposed components. In addition, a component-specific batch normalization layer is employed to reduce subject variability. Moreover, we employ two ensemble modes to integrate the outputs of all CNNs, comprehensively exploiting the diverse information of the decomposed components. Against the challenging cross-subject driver fatigue recognition task, the models under the framework all showed superior performance to the strong baselines. Specifically, the performance of different decomposition methods and ensemble modes was further compared. The results indicated that discrete wavelet transform-based ensemble CNN achieved the highest average classification accuracy of 83.48% among the compared methods. The proposed framework can be extended to any CNN architecture and be applied to any EEG-related tasks, opening the possibility of extracting more beneficial features from complex EEG data.
    Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through Deliberation. (arXiv:2010.11411v3 [cs.CY] UPDATED)
    Recently, there have been increasing calls for computer science curricula to complement existing technical training with topics related to Fairness, Accountability, Transparency, and Ethics. In this paper, we present Value Card, an educational toolkit to inform students and practitioners of the social impacts of different machine learning models via deliberation. This paper presents an early use of our approach in a college-level computer science course. Through an in-class activity, we report empirical data for the initial effectiveness of our approach. Our results suggest that the use of the Value Cards toolkit can improve students' understanding of both the technical definitions and trade-offs of performance metrics and apply them in real-world contexts, help them recognize the significance of considering diverse social values in the development of deployment of algorithmic systems, and enable them to communicate, negotiate and synthesize the perspectives of diverse stakeholders. Our study also demonstrates a number of caveats we need to consider when using the different variants of the Value Cards toolkit. Finally, we discuss the challenges as well as future applications of our approach.
    Convergence of Deep ReLU Networks. (arXiv:2107.12530v3 [cs.LG] UPDATED)
    We explore convergence of deep neural networks with the popular ReLU activation function, as the depth of the networks tends to infinity. To this end, we introduce the notion of activation domains and activation matrices of a ReLU network. By replacing applications of the ReLU activation function by multiplications with activation matrices on activation domains, we obtain an explicit expression of the ReLU network. We then identify the convergence of the ReLU networks as convergence of a class of infinite products of matrices. Sufficient and necessary conditions for convergence of these infinite products of matrices are studied. As a result, we establish necessary conditions for ReLU networks to converge that the sequence of weight matrices converges to the identity matrix and the sequence of the bias vectors converges to zero as the depth of ReLU networks increases to infinity. Moreover, we obtain sufficient conditions in terms of the weight matrices and bias vectors at hidden layers for pointwise convergence of deep ReLU networks. These results provide mathematical insights to the design strategy of the well-known deep residual networks in image classification.
    Reconstructing Sparse Multiplex Networks with Application to Covert Networks. (arXiv:2208.01739v3 [cs.SI] UPDATED)
    Network structure provides critical information for understanding the dynamic behavior of networks. However, the complete structure of real-world networks is often unavailable, thus it is crucially important to develop approaches to infer a more complete structure of networks. In this paper, we integrate the configuration model for generating random networks into an Expectation-Maximization-Aggregation (EMA) framework to reconstruct the complete structure of multiplex networks. We validate the proposed EMA framework against the random model on several real-world multiplex networks, including both covert and overt ones. It is found that the EMA framework generally achieves the best predictive accuracy compared to the EM framework and the random model. As the number of layers increases, the performance improvement of EMA over EM decreases. The inferred multiplex networks can be leveraged to inform the decision-making on monitoring covert networks as well as allocating limited resources for collecting additional information to improve reconstruction accuracy. For law enforcement agencies, the inferred complete network structure can be used to develop more effective strategies for covert network interdiction.
    Predictive Process Model Monitoring using Recurrent Neural Networks. (arXiv:2011.02819v3 [cs.LG] UPDATED)
    The field of predictive process monitoring focuses on case-level models to predict a single specific outcome such as a particular objective, (remaining) time, or next activity/remaining sequence. Recently, a longer-horizon, model-wide approach has been proposed in the form of process model forecasting, which predicts the future state of a whole process model through the forecasting of all activity-to-activity relations at once using time series forecasting. This paper introduces the concept of \emph{predictive process model monitoring} which sits in the middle of both predictive process monitoring and process model forecasting. Concretely, by modelling a process model as a set of constraints being present between activities over time, we can capture more detailed information between activities compared to process model forecasting, while being compatible with typical predictive process monitoring objectives which are often expressed in the same language as these constraints. To achieve this, Processes-As-Movies (PAM) is introduced, i.e., a novel technique capable of jointly mining and predicting declarative process constraints between activities in various windows of a process' execution. PAM predicts what declarative rules hold for a trace (objective-based), which also supports the prediction of all constraints together as a process model (model-based). Various recurrent neural network topologies inspired by video analysis tailored to temporal high-dimensional input are used to model the process model evolution with windows as time steps, including encoder-decoder long short-term memory networks, and convolutional long short-term memory networks. Results obtained over real-life event logs show that these topologies are effective in terms of predictive accuracy and precision.
    Exploring How Machine Learning Practitioners (Try To) Use Fairness Toolkits. (arXiv:2205.06922v2 [cs.HC] UPDATED)
    Recent years have seen the development of many open-source ML fairness toolkits aimed at helping ML practitioners assess and address unfairness in their systems. However, there has been little research investigating how ML practitioners actually use these toolkits in practice. In this paper, we conducted the first in-depth empirical exploration of how industry practitioners (try to) work with existing fairness toolkits. In particular, we conducted think-aloud interviews to understand how participants learn about and use fairness toolkits, and explored the generality of our findings through an anonymous online survey. We identified several opportunities for fairness toolkits to better address practitioner needs and scaffold them in using toolkits effectively and responsibly. Based on these findings, we highlight implications for the design of future open-source fairness toolkits that can support practitioners in better contextualizing, communicating, and collaborating around ML fairness efforts.
    Sampling random graph homomorphisms and applications to network data analysis. (arXiv:1910.09483v3 [math.PR] UPDATED)
    A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels.
    Calibrated simplex-mapping classification. (arXiv:2103.02926v2 [stat.ML] UPDATED)
    We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular $(n-1)$-dimensional simplex, $n$ being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.
    Towards Understanding Quality Challenges of the Federated Learning for Neural Networks: A First Look from the Lens of Robustness. (arXiv:2201.01409v2 [cs.LG] UPDATED)
    Federated learning (FL) is a distributed learning paradigm that preserves users' data privacy while leveraging the entire dataset of all participants. In FL, multiple models are trained independently on the clients and aggregated centrally to update a global model in an iterative process. Although this approach is excellent at preserving privacy, FL still suffers from quality issues such as attacks or byzantine faults. Recent attempts have been made to address such quality challenges on the robust aggregation techniques for FL. However, the effectiveness of state-of-the-art (SOTA) robust FL techniques is still unclear and lacks a comprehensive study. Therefore, to better understand the current quality status and challenges of these SOTA FL techniques in the presence of attacks and faults, we perform a large-scale empirical study to investigate the SOTA FL's quality from multiple angles of attacks, simulated faults (via mutation operators), and aggregation (defense) methods. In particular, we study FL's performance on the image classification tasks and use DNNs as our model type. Furthermore, we perform our study on two generic image datasets and one real-world federated medical image dataset. We also investigate the effect of the proportion of affected clients and the dataset distribution factors on the robustness of FL. After a large-scale analysis with 496 configurations, we find that most mutators on each user have a negligible effect on the final model in the generic datasets, and only one of them is effective in the medical dataset. Furthermore, we show that model poisoning attacks are more effective than data poisoning attacks. Moreover, choosing the most robust FL aggregator depends on the attacks and datasets. Finally, we illustrate that a simple ensemble of aggregators achieves a more robust solution than any single aggregator and is the best choice in 75% of the cases.
    BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm. (arXiv:2301.04120v1 [cs.NE])
    The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus.  ( 2 min )
    Towards AI-controlled FES-restoration of arm movements: neuromechanics-based reinforcement learning for 3-D reaching. (arXiv:2301.04004v1 [eess.SY])
    Reaching disabilities affect the quality of life. Functional Electrical Stimulation (FES) can restore lost motor functions. Yet, there remain challenges in controlling FES to induce desired movements. Neuromechanical models are valuable tools for developing FES control methods. However, focusing on the upper extremity areas, several existing models are either overly simplified or too computationally demanding for control purposes. Besides the model-related issues, finding a general method for governing the control rules for different tasks and subjects remains an engineering challenge. Here, we present our approach toward FES-based restoration of arm movements to address those fundamental issues in controlling FES. Firstly, we present our surface-FES-oriented neuromechanical models of human arms built using well-accepted, open-source software. The models are designed to capture significant dynamics in FES controls with minimal computational cost. Our models are customisable and can be used for testing different control methods. Secondly, we present the application of reinforcement learning (RL) as a general method for governing the control rules. In combination, our customisable models and RL-based control method open the possibility of delivering customised FES controls for different subjects and settings with minimal engineering intervention. We demonstrate our approach in planar and 3D settings.  ( 2 min )
    A Dietary Nutrition-aided Healthcare Platform via Effective Food Recognition on a Localized Singaporean Food Dataset. (arXiv:2301.03829v1 [cs.LG])
    Localized food datasets have profound meaning in revealing a country's special cuisines to explore people's dietary behaviors, which will shed light on their health conditions and disease development. In this paper, revolving around the demand for accurate food recognition in Singapore, we develop the FoodSG platform to incubate diverse healthcare-oriented applications as a service in Singapore, taking into account their shared requirements. We release a localized Singaporean food dataset FoodSG-233 with a systematic cleaning and curation pipeline for promoting future data management research in food computing. To overcome the hurdle in recognition performance brought by Singaporean multifarious food dishes, we propose to integrate supervised contrastive learning into our food recognition model FoodSG-SCL for the intrinsic capability to mine hard positive/negative samples and therefore boost the accuracy. Through a comprehensive evaluation, we share the insightful experience with practitioners in the data management community regarding food-related data-intensive healthcare applications. The FoodSG-233 dataset can be accessed via: https://foodlg.comp.nus.edu.sg/.  ( 2 min )
    Imbalanced Classification In Faulty Turbine Data: New Proximal Policy Optimization. (arXiv:2301.04049v1 [eess.SY])
    There is growing importance to detecting faults and implementing the best methods in industrial and real-world systems. We are searching for the most trustworthy and practical data-based fault detection methods proposed by artificial intelligence applications. In this paper, we propose a framework for fault detection based on reinforcement learning and a policy known as proximal policy optimization. As a result of the lack of fault data, one of the significant problems with the traditional policy is its weakness in detecting fault classes, which was addressed by changing the cost function. Using modified Proximal Policy Optimization, we can increase performance, overcome data imbalance, and better predict future faults. When our modified policy is implemented, all evaluation metrics will increase by $3\%$ to $4\%$ as compared to the traditional policy in the first benchmark, between $20\%$ and $55\%$ in the second benchmark, and between $6\%$ and $14\%$ in the third benchmark, as well as an improvement in performance and prediction speed compared to previous methods.  ( 2 min )
    There is No Big Brother or Small Brother: Knowledge Infusion in Language Models for Link Prediction and Question Answering. (arXiv:2301.04013v1 [cs.CL])
    The integration of knowledge graphs with deep learning is thriving in improving the performance of various natural language processing (NLP) tasks. In this paper, we focus on knowledge-infused link prediction and question answering using language models, T5, and BLOOM across three domains: Aviation, Movie, and Web. In this context, we infuse knowledge in large and small language models and study their performance, and find the performance to be similar. For the link prediction task on the Aviation Knowledge Graph, we obtain a 0.2 hits@1 score using T5-small, T5-base, T5-large, and BLOOM. Using template-based scripts, we create a set of 1 million synthetic factoid QA pairs in the aviation domain from National Transportation Safety Board (NTSB) reports. On our curated QA pairs, the three models of T5 achieve a 0.7 hits@1 score. We validate out findings with the paired student t-test and Cohen's kappa scores. For link prediction on Aviation Knowledge Graph using T5-small and T5-large, we obtain a Cohen's kappa score of 0.76, showing substantial agreement between the models. Thus, we infer that small language models perform similar to large language models with the infusion of knowledge.  ( 2 min )
    Manifold Restricted Interventional Shapley Values. (arXiv:2301.04041v1 [stat.ML])
    Shapley values are model-agnostic methods for explaining model predictions. Many commonly used methods of computing Shapley values, known as \emph{off-manifold methods}, rely on model evaluations on out-of-distribution input samples. Consequently, explanations obtained are sensitive to model behaviour outside the data distribution, which may be irrelevant for all practical purposes. While \emph{on-manifold methods} have been proposed which do not suffer from this problem, we show that such methods are overly dependent on the input data distribution, and therefore result in unintuitive and misleading explanations. To circumvent these problems, we propose \emph{ManifoldShap}, which respects the model's domain of validity by restricting model evaluations to the data manifold. We show, theoretically and empirically, that ManifoldShap is robust to off-manifold perturbations of the model and leads to more accurate and intuitive explanations than existing state-of-the-art Shapley methods.  ( 2 min )
    Sentiment-based Engagement Strategies for intuitive Human-Robot Interaction. (arXiv:2301.03867v1 [cs.RO])
    Emotion expressions serve as important communicative signals and are crucial cues in intuitive interactions between humans. Hence, it is essential to include these fundamentals in robotic behavior strategies when interacting with humans to promote mutual understanding and to reduce misjudgements. We tackle this challenge by detecting and using the emotional state and attention for a sentiment analysis of potential human interaction partners to select well-adjusted engagement strategies. This way, we pave the way for more intuitive human-robot interactions, as the robot's action conforms to the person's mood and expectation. We propose four different engagement strategies with implicit and explicit communication techniques that we implement on a mobile robot platform for initial experiments.  ( 2 min )
    Quantifying Assurance in Learning-enabled Systems. (arXiv:2006.10345v1 [cs.SE] CROSS LISTED)
    Dependability assurance of systems embedding machine learning(ML) components---so called learning-enabled systems (LESs)---is a key step for their use in safety-critical applications. In emerging standardization and guidance efforts, there is a growing consensus in the value of using assurance cases for that purpose. This paper develops a quantitative notion of assurance that an LES is dependable, as a core component of its assurance case, also extending our prior work that applied to ML components. Specifically, we characterize LES assurance in the form of assurance measures: a probabilistic quantification of confidence that an LES possesses system-level properties associated with functional capabilities and dependability attributes. We illustrate the utility of assurance measures by application to a real world autonomous aviation system, also describing their role both in i) guiding high-level, runtime risk mitigation decisions and ii) as a core component of the associated dynamic assurance case.  ( 2 min )
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v2 [cs.LG] UPDATED)
    Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.  ( 2 min )
    Actor-Director-Critic: A Novel Deep Reinforcement Learning Framework. (arXiv:2301.03887v1 [cs.LG])
    In this paper, we propose actor-director-critic, a new framework for deep reinforcement learning. Compared with the actor-critic framework, the director role is added, and action classification and action evaluation are applied simultaneously to improve the decision-making performance of the agent. Firstly, the actions of the agent are divided into high quality actions and low quality actions according to the rewards returned from the environment. Then, the director network is trained to have the ability to discriminate high and low quality actions and guide the actor network to reduce the repetitive exploration of low quality actions in the early stage of training. In addition, we propose an improved double estimator method to better solve the problem of overestimation in the field of reinforcement learning. For the two critic networks used, we design two target critic networks for each critic network instead of one. In this way, the target value of each critic network can be calculated by taking the average of the outputs of the two target critic networks, which is more stable and accurate than using only one target critic network to obtain the target value. In order to verify the performance of the actor-director-critic framework and the improved double estimator method, we applied them to the TD3 algorithm to improve the TD3 algorithm. Then, we carried out experiments in multiple environments in MuJoCo and compared the experimental data before and after the algorithm improvement. The final experimental results show that the improved algorithm can achieve faster convergence speed and higher total return.  ( 2 min )
    RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration. (arXiv:2301.03904v1 [cs.AR])
    The increasing interest in TinyML, i.e., near-sensor machine learning on power budgets of a few tens of mW, is currently pushing toward enabling TinyML-class training as opposed to inference only. Current training algorithms, based on various forms of error and gradient backpropagation, rely on floating-point matrix operations to meet the precision and dynamic range requirements. So far, the energy and power cost of these operations has been considered too high for TinyML scenarios. This paper addresses the open challenge of near-sensor training on a few mW power budget and presents RedMulE - Reduced-Precision Matrix Multiplication Engine, a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration, supporting FP16, as well as hybrid FP8 formats, with {sign, exponent, mantissa}=({1,4,3}, {1,5,2}). We integrate RedMule into a Parallel Ultra-Low-Power (PULP) cluster containing eight energy-efficient RISC-V cores sharing a tightly-coupled data memory and implement the resulting system in a 22 nm technology. At its best efficiency point (@ 470 MHz, 0.65 V), the RedMulE-augmented PULP cluster achieves 755 GFLOPS/W and 920 GFLOPS/W during regular General Matrix-Matrix Multiplication (GEMM), and up to 1.19 TFLOPS/W and 1.67 TFLOPS/W when executing GEMM-Ops, respectively, for FP16 and FP8 input/output tensors. In its best performance point (@ 613 MHz, 0.8 V), RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements and consuming less than 60 mW on average, thus enabling on-device training of deep learning models in TinyML application scenarios while retaining the flexibility to tackle other classes of common linear algebra problems efficiently.  ( 2 min )
    Neighborhood-Regularized Self-Training for Learning with Few Labels. (arXiv:2301.03726v1 [cs.LG])
    Training deep neural networks (DNNs) with limited supervision has been a popular research topic as it can significantly alleviate the annotation burden. Self-training has been successfully applied in semi-supervised learning tasks, but one drawback of self-training is that it is vulnerable to the label noise from incorrect pseudo labels. Inspired by the fact that samples with similar labels tend to share similar representations, we develop a neighborhood-based sample selection approach to tackle the issue of noisy pseudo labels. We further stabilize self-training via aggregating the predictions from different rounds during sample selection. Experiments on eight tasks show that our proposed method outperforms the strongest self-training baseline with 1.83% and 2.51% performance gain for text and graph datasets on average. Our further analysis demonstrates that our proposed data selection strategy reduces the noise of pseudo labels by 36.8% and saves 57.3% of the time when compared with the best baseline. Our code and appendices will be uploaded to https://github.com/ritaranx/NeST.  ( 2 min )
    Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps. (arXiv:2301.03931v1 [cs.GT])
    In this paper we present a first-order method that admits near-optimal convergence rates for convex/concave min-max problems while requiring a simple and intuitive analysis. Similarly to the seminal work of Nemirovski and the recent approach of Piliouras et al. in normal form games, our work is based on the fact that the update rule of the Proximal Point method (PP) can be approximated up to accuracy $\epsilon$ with only $\mathcal{O}(\log 1/\epsilon)$ additional gradient-calls through the iterations of a contraction map. Then combining the analysis of (PP) method with an error-propagation analysis we establish that the resulting first order method, called \textit{Clairvoyant Extra Gradient}, admits near-optimal time-average convergence for general domains and last-iterate convergence in the unconstrained case.  ( 2 min )
    Is Federated Learning a Practical PET Yet?. (arXiv:2301.04017v1 [cs.CR])
    Federated learning (FL) is a framework for users to jointly train a machine learning model. FL is promoted as a privacy-enhancing technology (PET) that provides data minimization: data never "leaves" personal devices and users share only model updates with a server (e.g., a company) coordinating the distributed training. We assess the realistic (i.e., worst-case) privacy guarantees that are provided to users who are unable to trust the server. To this end, we propose an attack against FL protected with distributed differential privacy (DDP) and secure aggregation (SA). The attack method is based on the introduction of Sybil devices that deviate from the protocol to expose individual users' data for reconstruction by the server. The underlying root cause for the vulnerability to our attack is the power imbalance. The server orchestrates the whole protocol and users are given little guarantees about the selection of other users participating in the protocol. Moving forward, we discuss requirements for an FL protocol to guarantee DDP without asking users to trust the server. We conclude that such systems are not yet practical.  ( 2 min )
    On adversarial robustness and the use of Wasserstein ascent-descent dynamics to enforce it. (arXiv:2301.03662v1 [cs.LG])
    We propose iterative algorithms to solve adversarial problems in a variety of supervised learning settings of interest. Our algorithms, which can be interpreted as suitable ascent-descent dynamics in Wasserstein spaces, take the form of a system of interacting particles. These interacting particle dynamics are shown to converge toward appropriate mean-field limit equations in certain large number of particles regimes. In turn, we prove that, under certain regularity assumptions, these mean-field equations converge, in the large time limit, toward approximate Nash equilibria of the original adversarial learning problems. We present results for nonconvex-nonconcave settings, as well as for nonconvex-concave ones. Numerical experiments illustrate our results.  ( 2 min )
    SantaCoder: don't reach for the stars!. (arXiv:2301.03988v1 [cs.SE])
    The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.  ( 2 min )
    Look Beyond Bias with Entropic Adversarial Data Augmentation. (arXiv:2301.03844v1 [cs.LG])
    Deep neural networks do not discriminate between spurious and causal patterns, and will only learn the most predictive ones while ignoring the others. This shortcut learning behaviour is detrimental to a network's ability to generalize to an unknown test-time distribution in which the spurious correlations do not hold anymore. Debiasing methods were developed to make networks robust to such spurious biases but require to know in advance if a dataset is biased and make heavy use of minority counterexamples that do not display the majority bias of their class. In this paper, we argue that such samples should not be necessarily needed because the ''hidden'' causal information is often also contained in biased images. To study this idea, we propose 3 publicly released synthetic classification benchmarks, exhibiting predictive classification shortcuts, each of a different and challenging nature, without any minority samples acting as counterexamples. First, we investigate the effectiveness of several state-of-the-art strategies on our benchmarks and show that they do not yield satisfying results on them. Then, we propose an architecture able to succeed on our benchmarks, despite their unusual properties, using an entropic adversarial data augmentation training scheme. An encoder-decoder architecture is tasked to produce images that are not recognized by a classifier, by maximizing the conditional entropy of its outputs, and keep as much as possible of the initial content. A precise control of the information destroyed, via a disentangling process, enables us to remove the shortcut and leave everything else intact. Furthermore, results competitive with the state-of-the-art on the BAR dataset ensure the applicability of our method in real-life situations.  ( 2 min )
    Proceedings of the NeurIPS 2021 Workshop on Machine Learning for the Developing World: Global Challenges. (arXiv:2301.04007v1 [cs.LG])
    These are the proceedings of the 5th workshop on Machine Learning for the Developing World (ML4D), held as part of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) on December 14th, 2021.  ( 2 min )
    Hint assisted reinforcement learning: an application in radio astronomy. (arXiv:2301.03933v1 [astro-ph.IM])
    Model based reinforcement learning has proven to be more sample efficient than model free methods. On the other hand, the construction of a dynamics model in model based reinforcement learning has increased complexity. Data processing tasks in radio astronomy are such situations where the original problem which is being solved by reinforcement learning itself is the creation of a model. Fortunately, many methods based on heuristics or signal processing do exist to perform the same tasks and we can leverage them to propose the best action to take, or in other words, to provide a `hint'. We propose to use `hints' generated by the environment as an aid to the reinforcement learning process mitigating the complexity of model construction. We modify the soft actor critic algorithm to use hints and use the alternating direction method of multipliers algorithm with inequality constraints to train the agent. Results in several environments show that we get the increased sample efficiency by using hints as compared to model free methods.  ( 2 min )
    Community Detection with Known, Unknown, or Partially Known Auxiliary Latent Variables. (arXiv:2301.04088v1 [cs.SI])
    Empirical observations suggest that in practice, community membership does not completely explain the dependency between the edges of an observation graph. The residual dependence of the graph edges are modeled in this paper, to first order, by auxiliary node latent variables that affect the statistics of the graph edges but carry no information about the communities of interest. We then study community detection in graphs obeying the stochastic block model and censored block model with auxiliary latent variables. We analyze the conditions for exact recovery when these auxiliary latent variables are unknown, representing unknown nuisance parameters or model mismatch. We also analyze exact recovery when these secondary latent variables have been either fully or partially revealed. Finally, we propose a semidefinite programming algorithm for recovering the desired labels when the secondary labels are either known or unknown. We show that exact recovery is possible by semidefinite programming down to the respective maximum likelihood exact recovery threshold.
    Neural Radiance Field Codebooks. (arXiv:2301.04101v1 [cs.CV])
    Compositional representations of the world are a promising step towards enabling high-level scene understanding and efficient transfer to downstream tasks. Learning such representations for complex scenes and tasks remains an open challenge. Towards this goal, we introduce Neural Radiance Field Codebooks (NRC), a scalable method for learning object-centric representations through novel view reconstruction. NRC learns to reconstruct scenes from novel views using a dictionary of object codes which are decoded through a volumetric renderer. This enables the discovery of reoccurring visual and geometric patterns across scenes which are transferable to downstream tasks. We show that NRC representations transfer well to object navigation in THOR, outperforming 2D and 3D representation learning methods by 3.1% success rate. We demonstrate that our approach is able to perform unsupervised segmentation for more complex synthetic (THOR) and real scenes (NYU Depth) better than prior methods (29% relative improvement). Finally, we show that NRC improves on the task of depth ordering by 5.5% accuracy in THOR.
    Privacy-Preserving Record Linkage for Cardinality Counting. (arXiv:2301.04000v1 [cs.CR])
    Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget {\epsilon} = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.
    On the Robustness of AlphaFold: A COVID-19 Case Study. (arXiv:2301.04093v1 [cs.LG])
    Protein folding neural networks (PFNNs) such as AlphaFold predict remarkably accurate structures of proteins compared to other approaches. However, the robustness of such networks has heretofore not been explored. This is particularly relevant given the broad social implications of such technologies and the fact that biologically small perturbations in the protein sequence do not generally lead to drastic changes in the protein structure. In this paper, we demonstrate that AlphaFold does not exhibit such robustness despite its high accuracy. This raises the challenge of detecting and quantifying the extent to which these predicted protein structures can be trusted. To measure the robustness of the predicted structures, we utilize (i) the root-mean-square deviation (RMSD) and (ii) the Global Distance Test (GDT) similarity measure between the predicted structure of the original sequence and the structure of its adversarially perturbed version. We prove that the problem of minimally perturbing protein sequences to fool protein folding neural networks is NP-complete. Based on the well-established BLOSUM62 sequence alignment scoring matrix, we generate adversarial protein sequences and show that the RMSD between the predicted protein structure and the structure of the original sequence are very large when the adversarial changes are bounded by (i) 20 units in the BLOSUM62 distance, and (ii) five residues (out of hundreds or thousands of residues) in the given protein sequence. In our experimental evaluation, we consider 111 COVID-19 proteins in the Universal Protein resource (UniProt), a central resource for protein data managed by the European Bioinformatics Institute, Swiss Institute of Bioinformatics, and the US Protein Information Resource. These result in an overall GDT similarity test score average of around 34%, demonstrating a substantial drop in the performance of AlphaFold.
    Temporal Weights. (arXiv:2301.04126v1 [cs.NE])
    In artificial neural networks, weights are a static representation of synapses. However, synapses are not static, they have their own interacting dynamics over time. To instill weights with interacting dynamics, we use a model describing synchronization that is capable of capturing core mechanisms of a range of neural and general biological phenomena over time. An ideal fit for these Temporal Weights (TW) are Neural ODEs, with continuous dynamics and a dependency on time. The resulting recurrent neural networks efficiently model temporal dynamics by computing on the ordering of sequences, and the length and scale of time. By adding temporal weights to a model, we demonstrate better performance, smaller models, and data efficiency on sparse, irregularly sampled time series datasets.
    Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation. (arXiv:2204.01171v3 [cs.CL] UPDATED)
    Current language generation models suffer from issues such as repetition, incoherence, and hallucinations. An often-repeated hypothesis is that this brittleness of generation models is caused by the training and the generation procedure mismatch, also referred to as exposure bias. In this paper, we verify this hypothesis by analyzing exposure bias from an imitation learning perspective. We show that exposure bias leads to an accumulation of errors, analyze why perplexity fails to capture this accumulation, and empirically show that this accumulation results in poor generation quality. Source code to reproduce these experiments is available at https://github.com/kushalarora/quantifying_exposure_bias
    Deep learning approach for interruption attacks detection in LEO satellite networks. (arXiv:2301.03998v1 [cs.CR])
    The developments of satellite communication in network systems require strong and effective security plans. Attacks such as denial of service (DoS) can be detected through the use of machine learning techniques, especially under normal operational conditions. This work aims to provide an interruption detection strategy for Low Earth Orbit (\textsf{LEO}) satellite networks using deep learning algorithms. Both the training, and the testing of the proposed models are carried out with our own communication datasets, created by utilizing a satellite traffic (benign and malicious) that was generated using satellite networks simulation platforms, Omnet++ and Inet. We test different deep learning algorithms including Multi Layer Perceptron (MLP), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Gated Recurrent Units (GRU), and Long Short-term Memory (LSTM). Followed by a full analysis and investigation of detection rate in both binary classification, and multi-classes classification that includes different interruption categories such as Distributed DoS (DDoS), Network Jamming, and meteorological disturbances. Simulation results for both classification types surpassed 99.33% in terms of detection rate in scenarios of full network surveillance. However, in more realistic scenarios, the best-recorded performance was 96.12% for the detection of binary traffic and 94.35% for the detection of multi-class traffic with a false positive rate of 3.72%, using a hybrid model that combines MLP and GRU. This Deep Learning approach efficiency calls for the necessity of using machine learning methods to improve security and to give more awareness to search for solutions that facilitate data collection in LEO satellite networks.
    Constraining cosmological parameters from N-body simulations with Variational Bayesian Neural Networks. (arXiv:2301.03991v1 [astro-ph.IM])
    Methods based on Deep Learning have recently been applied on astrophysical parameter recovery thanks to their ability to capture information from complex data. One of these methods is the approximate Bayesian Neural Networks (BNNs) which have demonstrated to yield consistent posterior distribution into the parameter space, helpful for uncertainty quantification. However, as any modern neural networks, they tend to produce overly confident uncertainty estimates and can introduce bias when BNNs are applied to data. In this work, we implement multiplicative normalizing flows (MNFs), a family of approximate posteriors for the parameters of BNNs with the purpose of enhancing the flexibility of the variational posterior distribution, to extract $\Omega_m$, $h$, and $\sigma_8$ from the QUIJOTE simulations. We have compared this method with respect to the standard BNNs, and the flipout estimator. We found that MNFs combined with BNNs outperform the other models obtaining predictive performance with almost one order of magnitude larger that standard BNNs, $\sigma_8$ extracted with high accuracy ($r^2=0.99$), and precise uncertainty estimates. The latter implies that MNFs provide more realistic predictive distribution closer to the true posterior mitigating the bias introduced by the variational approximation and allowing to work with well-calibrated networks.
    Towards AI-controlled FES-restoration of arm movements: Controlling for progressive muscular fatigue with Gaussian state-space models. (arXiv:2301.04005v1 [eess.SY])
    Reaching disability limits an individual's ability in performing daily tasks. Surface Functional Electrical Stimulation (FES) offers a non-invasive solution to restore lost ability. However, inducing desired movements using FES is still an open engineering problem. This problem is accentuated by the complexities of human arms' neuromechanics and the variations across individuals. Reinforcement Learning (RL) emerges as a promising approach to govern customised control rules for different settings. Yet, one remaining challenge of controlling FES systems for RL is unobservable muscle fatigue that progressively changes as an unknown function of the stimulation, thereby breaking the Markovian assumption of RL. In this work, we present a method to address the unobservable muscle fatigue issue, allowing our RL controller to achieve higher control performances. Our method is based on a Gaussian State-Space Model (GSSM) that utilizes recurrent neural networks to learn Markovian state-spaces from partial observations. The GSSM is used as a filter that converts the observations into the state-space representation for RL to preserve the Markovian assumption. Here, we start with presenting the modification of the original GSSM to address an overconfident issue. We then present the interaction between RL and the modified GSSM, followed by the setup for FES control learning. We test our RL-GSSM system on a planar reaching setting in simulation using a detailed neuromechanical model. The results show that the GSSM can help improve the RL's control performance to the comparable level of the ideal case that the fatigue is observable.
    Self-supervised Contrastive Representation Learning for Semi-supervised Time-Series Classification. (arXiv:2208.06616v2 [cs.LG] UPDATED)
    Learning time-series representations when only unlabeled data or few labeled samples are available can be a challenging task. Recently, contrastive self-supervised learning has shown great improvement in extracting useful representations from unlabeled data via contrasting different augmented views of data. In this work, we propose a novel Time-Series representation learning framework via Temporal and Contextual Contrasting (TS-TCC) that learns representations from unlabeled data with contrastive learning. Specifically, we propose time-series specific weak and strong augmentations and use their views to learn robust temporal relations in the proposed temporal contrasting module, besides learning discriminative representations by our proposed contextual contrasting module. Additionally, we conduct a systematic study of time-series data augmentation selection, which is a key part of contrastive learning. We also extend TS-TCC to the semi-supervised learning settings and propose a Class-Aware TS-TCC (CA-TCC) that benefits from the available few labeled data to further improve representations learned by TS-TCC. Specifically, we leverage robust pseudo labels produced by TS-TCC to realize class-aware contrastive loss. Extensive experiments show that the linear evaluation of the features learned by our proposed framework performs comparably with the fully supervised training. Additionally, our framework shows high efficiency in few labeled data and transfer learning scenarios. The code is publicly available at \url{https://github.com/emadeldeen24/CA-TCC}.  ( 2 min )
    Federated Learning for Energy Constrained IoT devices: A systematic mapping study. (arXiv:2301.03720v1 [cs.LG])
    Federated Machine Learning (Fed ML) is a new distributed machine learning technique applied to collaboratively train a global model using clients local data without transmitting it. Nodes only send parameter updates (e.g., weight updates in the case of neural networks), which are fused together by the server to build the global model. By not divulging node data, Fed ML guarantees its confidentiality, a crucial aspect of network security, which enables it to be used in the context of data-sensitive Internet of Things (IoT) and mobile applications, such as smart Geo-location and the smart grid. However, most IoT devices are particularly energy constrained, which raises the need to optimize the Fed ML process for efficient training tasks and optimized power consumption. In this paper, we conduct, to the best of our knowledge, the first Systematic Mapping Study (SMS) on Fed ML optimization techniques for energy-constrained IoT devices. From a total of more than 800 papers, we select 67 that satisfy our criteria and give a structured overview of the field using a set of carefully chosen research questions. Finally, we attempt to provide an analysis of the energy-constrained Fed ML state of the art and try to outline some potential recommendations for the research community.  ( 2 min )
    Transfer learning for conflict and duplicate detection in software requirement pairs. (arXiv:2301.03709v1 [cs.SE])
    Consistent and holistic expression of software requirements is important for the success of software projects. In this study, we aim to enhance the efficiency of the software development processes by automatically identifying conflicting and duplicate software requirement specifications. We formulate the conflict and duplicate detection problem as a requirement pair classification task. We design a novel transformers-based architecture, SR-BERT, which incorporates Sentence-BERT and Bi-encoders for the conflict and duplicate identification task. Furthermore, we apply supervised multi-stage fine-tuning to the pre-trained transformer models. We test the performance of different transfer models using four different datasets. We find that sequentially trained and fine-tuned transformer models perform well across the datasets with SR-BERT achieving the best performance for larger datasets. We also explore the cross-domain performance of conflict detection models and adopt a rule-based filtering approach to validate the model classifications. Our analysis indicates that the sentence pair classification approach and the proposed transformer-based natural language processing strategies can contribute significantly to achieving automation in conflict and duplicate detection
    UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion. (arXiv:2301.03801v1 [cs.SD])
    Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.
    Chatbots in a Honeypot World. (arXiv:2301.03771v1 [cs.CR])
    Question-and-answer agents like ChatGPT offer a novel tool for use as a potential honeypot interface in cyber security. By imitating Linux, Mac, and Windows terminal commands and providing an interface for TeamViewer, nmap, and ping, it is possible to create a dynamic environment that can adapt to the actions of attackers and provide insight into their tactics, techniques, and procedures (TTPs). The paper illustrates ten diverse tasks that a conversational agent or large language model might answer appropriately to the effects of command-line attacker. The original result features feasibility studies for ten model tasks meant for defensive teams to mimic expected honeypot interfaces with minimal risks. Ultimately, the usefulness outside of forensic activities stems from whether the dynamic honeypot can extend the time-to-conquer or otherwise delay attacker timelines short of reaching key network assets like databases or confidential information. While ongoing maintenance and monitoring may be required, ChatGPT's ability to detect and deflect malicious activity makes it a valuable option for organizations seeking to enhance their cyber security posture. Future work will focus on cybersecurity layers, including perimeter security, host virus detection, and data security.  ( 2 min )
    Predicting Drivers' Route Trajectories in Last-Mile Delivery Using A Pair-wise Attention-based Pointer Neural Network. (arXiv:2301.03802v1 [cs.LG])
    In last-mile delivery, drivers frequently deviate from planned delivery routes because of their tacit knowledge of the road and curbside infrastructure, customer availability, and other characteristics of the respective service areas. Hence, the actual stop sequences chosen by an experienced human driver may be potentially preferable to the theoretical shortest-distance routing under real-life operational conditions. Thus, being able to predict the actual stop sequence that a human driver would follow can help to improve route planning in last-mile delivery. This paper proposes a pair-wise attention-based pointer neural network for this prediction task using drivers' historical delivery trajectory data. In addition to the commonly used encoder-decoder architecture for sequence-to-sequence prediction, we propose a new attention mechanism based on an alternative specific neural network to capture the local pair-wise information for each pair of stops. To further capture the global efficiency of the route, we propose a new iterative sequence generation algorithm that is used after model training to identify the first stop of a route that yields the lowest operational cost. Results from an extensive case study on real operational data from Amazon's last-mile delivery operations in the US show that our proposed method can significantly outperform traditional optimization-based approaches and other machine learning methods (such as the Long Short-Term Memory encoder-decoder and the original pointer network) in finding stop sequences that are closer to high-quality routes executed by experienced drivers in the field. Compared to benchmark models, the proposed model can increase the average prediction accuracy of the first four stops from around 0.2 to 0.312, and reduce the disparity between the predicted route and the actual route by around 15%.  ( 2 min )
    Learning to Perceive in Deep Model-Free Reinforcement Learning. (arXiv:2301.03730v1 [cs.LG])
    This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.  ( 2 min )
    Markovian Sliced Wasserstein Distances: Beyond Independent Projections. (arXiv:2301.03749v1 [stat.ML])
    Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of MSW.  ( 2 min )
    On The Fragility of Learned Reward Functions. (arXiv:2301.03652v1 [cs.LG])
    Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature.  ( 2 min )
    Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. (arXiv:2301.03797v1 [cs.SE])
    Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.  ( 2 min )
    Online Backfilling with No Regret for Large-Scale Image Retrieval. (arXiv:2301.03767v1 [cs.CV])
    Backfilling is the process of re-extracting all gallery embeddings from upgraded models in image retrieval systems. It inevitably requires a prohibitively large amount of computational cost and even entails the downtime of the service. Although backward-compatible learning sidesteps this challenge by tackling query-side representations, this leads to suboptimal solutions in principle because gallery embeddings cannot benefit from model upgrades. We address this dilemma by introducing an online backfilling algorithm, which enables us to achieve a progressive performance improvement during the backfilling process while not sacrificing the final performance of new model after the completion of backfilling. To this end, we first propose a simple distance rank merge technique for online backfilling. Then, we incorporate a reverse transformation module for more effective and efficient merging, which is further enhanced by adopting a metric-compatible contrastive learning approach. These two components help to make the distances of old and new models compatible, resulting in desirable merge results during backfilling with no extra computational overhead. Extensive experiments show the effectiveness of our framework on four standard benchmarks in various settings.  ( 2 min )
    Tensor Denoising via Amplification and Stable Rank Methods. (arXiv:2301.03761v1 [cs.LG])
    Tensors in the form of multilinear arrays are ubiquitous in data science applications. Captured real-world data, including video, hyperspectral images, and discretized physical systems, naturally occur as tensors and often come with attendant noise. Under the additive noise model and with the assumption that the underlying clean tensor has low rank, many denoising methods have been created that utilize tensor decomposition to effect denoising through low rank tensor approximation. However, all such decomposition methods require estimating the tensor rank, or related measures such as the tensor spectral and nuclear norms, all of which are NP-hard problems. In this work we adapt the previously developed framework of tensor amplification, which provides good approximations of the spectral and nuclear tensor norms, to denoising synthetic tensors of various sizes, ranks, and noise levels, along with real-world tensors derived from physiological signals. We also introduce denoising methods based on two variations of rank estimates called stable $X$-rank and stable slice rank. The experimental results show that in the low rank context, tensor-based amplification provides comparable denoising performance in high signal-to-noise ratio (SNR) settings and superior performance in noisy (i.e., low SNR) settings, while the stable $X$-rank method achieves superior denoising performance on the physiological signal data.  ( 2 min )
    A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v1 [cs.LG])
    We present a theory of ensemble diversity, explaining the nature and effect of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the holy grail of ensemble learning, an open question for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of an ensemble. In particular, we prove a family of exact bias-variance-diversity decompositions, for both classification and regression losses, e.g., squared, and cross-entropy. The framework provides a methodology to automatically identify the combiner rule enabling such a decomposition, specific to the loss. The formulation of diversity is therefore dependent on just two design choices: the loss, and the combiner. For certain choices (e.g., 0-1 loss with majority voting) the effect of diversity is necessarily dependent on the target label. Experiments illustrate how we can use our framework to understand the diversity-encouraging mechanisms of popular ensemble methods: Bagging, Boosting, and Random Forests.  ( 2 min )
    Multiscale Metamorphic VAE for 3D Brain MRI Synthesis. (arXiv:2301.03588v1 [eess.IV])
    Generative modeling of 3D brain MRIs presents difficulties in achieving high visual fidelity while ensuring sufficient coverage of the data distribution. In this work, we propose to address this challenge with composable, multiscale morphological transformations in a variational autoencoder (VAE) framework. These transformations are applied to a chosen reference brain image to generate MRI volumes, equipping the model with strong anatomical inductive biases. We structure the VAE latent space in a way such that the model covers the data distribution sufficiently well. We show substantial performance improvements in FID while retaining comparable, or superior, reconstruction quality compared to prior work based on VAEs and generative adversarial networks (GANs).  ( 2 min )
    Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding. (arXiv:2301.03765v1 [cs.CL])
    Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 4 widely used pretrained language models, and find it particularly superior for models with few parameters or long input.  ( 2 min )
    Time-aware Hyperbolic Graph Attention Network for Session-based Recommendation. (arXiv:2301.03780v1 [cs.IR])
    Session-based Recommendation (SBR) is to predict users' next interested items based on their previous browsing sessions. Existing methods model sessions as graphs or sequences to estimate user interests based on their interacted items to make recommendations. In recent years, graph-based methods have achieved outstanding performance on SBR. However, none of these methods consider temporal information, which is a crucial feature in SBR as it indicates timeliness or currency. Besides, the session graphs exhibit a hierarchical structure and are demonstrated to be suitable in hyperbolic geometry. But few papers design the models in hyperbolic spaces and this direction is still under exploration. In this paper, we propose Time-aware Hyperbolic Graph Attention Network (TA-HGAT) - a novel hyperbolic graph neural network framework to build a session-based recommendation model considering temporal information. More specifically, there are three components in TA-HGAT. First, a hyperbolic projection module transforms the item features into hyperbolic space. Second, the time-aware graph attention module models time intervals between items and the users' current interests. Third, an evolutionary loss at the end of the model provides an accurate prediction of the recommended item based on the given timestamp. TA-HGAT is built in a hyperbolic space to learn the hierarchical structure of session graphs. Experimental results show that the proposed TA-HGAT has the best performance compared to ten baseline models on two real-world datasets.  ( 2 min )
    Best Arm Identification in Stochastic Bandits: Beyond $\beta-$optimality. (arXiv:2301.03785v1 [stat.ML])
    This paper focuses on best arm identification (BAI) in stochastic multi-armed bandits (MABs) in the fixed-confidence, parametric setting. In such pure exploration problems, the accuracy of the sampling strategy critically hinges on the sequential allocation of the sampling resources among the arms. The existing approaches to BAI address the following question: what is an optimal sampling strategy when we spend a $\beta$ fraction of the samples on the best arm? These approaches treat $\beta$ as a tunable parameter and offer efficient algorithms that ensure optimality up to selecting $\beta$, hence $\beta-$optimality. However, the BAI decisions and performance can be highly sensitive to the choice of $\beta$. This paper provides a BAI algorithm that is agnostic to $\beta$, dispensing with the need for tuning $\beta$, and specifies an optimal allocation strategy, including the optimal value of $\beta$. Furthermore, the existing relevant literature focuses on the family of exponential distributions. This paper considers a more general setting of any arbitrary family of distributions parameterized by their mean values (under mild regularity conditions).  ( 2 min )
    On the Susceptibility and Robustness of Time Series Models through Adversarial Attack and Defense. (arXiv:2301.03703v1 [cs.LG])
    Under adversarial attacks, time series regression and classification are vulnerable. Adversarial defense, on the other hand, can make the models more resilient. It is important to evaluate how vulnerable different time series models are to attacks and how well they recover using defense. The sensitivity to various attacks and the robustness using the defense of several time series models are investigated in this study. Experiments are run on seven-time series models with three adversarial attacks and one adversarial defense. According to the findings, all models, particularly GRU and RNN, appear to be vulnerable. LSTM and GRU also have better defense recovery. FGSM exceeds the competitors in terms of attacks. PGD attacks are more difficult to recover from than other sorts of attacks.  ( 2 min )
    Membership Inference Attacks Against Latent Factor Model. (arXiv:2301.03596v1 [cs.CR])
    The advent of the information age has led to the problems of information overload and unclear demands. As an information filtering system, personalized recommendation systems predict users' behavior and preference for items and improves users' information acquisition efficiency. However, recommendation systems usually use highly sensitive user data for training. In this paper, we use the latent factor model as the recommender to get the list of recommended items, and we representing users from relevant items Compared with the traditional member inference against machine learning classifiers. We construct a multilayer perceptron model with two hidden layers as the attack model to complete the member inference. Moreover, a shadow recommender is established to derive the labeled training data for the attack model. The attack model is trained on the dataset generated by the shadow recommender and tested on the dataset generated by the target recommender. The experimental data show that the AUC index of our attack model can reach 0.857 on the real dataset MovieLens, which shows that the attack model has good performance.  ( 2 min )
    Semiparametric Regression for Spatial Data via Deep Learning. (arXiv:2301.03747v1 [stat.ML])
    In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.  ( 2 min )
    On the Minimax Regret for Linear Bandits in a wide variety of Action Spaces. (arXiv:2301.03597v1 [cs.LG])
    As noted in the works of \cite{lattimore2020bandit}, it has been mentioned that it is an open problem to characterize the minimax regret of linear bandits in a wide variety of action spaces. In this article we present an optimal regret lower bound for a wide class of convex action spaces.  ( 2 min )
    PatentsView-Evaluation: Evaluation Datasets and Tools to Advance Research on Inventor Name Disambiguation. (arXiv:2301.03591v1 [cs.DL])
    We present PatentsView-Evaluation, a Python package that enables researchers to evaluate the performance of inventor name disambiguation systems such as PatentsView.org. The package includes benchmark datasets and evaluation tools, and aims to advance research on inventor name disambiguation by providing access to high-quality evaluation data and improving evaluation standards.  ( 2 min )
    Transformers as Policies for Variable Action Environments. (arXiv:2301.03679v1 [cs.AI])
    In this project we demonstrate the effectiveness of the transformer encoder as a viable architecture for policies in variable action environments. Using it, we train an agent using Proximal Policy Optimisation (PPO) on multiple maps against scripted opponents in the Gym-$\mu$RTS environment. The final agent is able to achieve a higher return using half the computational resources of the next-best RL agent, which used the GridNet architecture. The source code and pre-trained models are available here: https://github.com/NiklasZ/transformers-for-variable-action-envs  ( 2 min )
    Optimal Power Flow Based on Physical-Model-Integrated Neural Network with Worth-Learning Data Generation. (arXiv:2301.03766v1 [cs.LG])
    Fast and reliable solvers for optimal power flow (OPF) problems are attracting surging research interest. As surrogates of physical-model-based OPF solvers, neural network (NN) solvers can accelerate the solving process. However, they may be unreliable for ``unseen" inputs when the training dataset is unrepresentative. Enhancing the representativeness of the training dataset for NN solvers is indispensable but is not well studied in the literature. To tackle this challenge, we propose an OPF solver based on a physical-model-integrated NN with worth-learning data generation. The designed NN is a combination of a conventional multi-layer perceptron (MLP) and an OPF-model module, which outputs not only the optimal decision variables of the OPF problem but also the constraints violation degree. Based on this NN, the worth-learning data generation method can identify feasible samples that are not well generalized by the NN. By iteratively applying this method and including the newly identified worth-learning samples in the training set, the representativeness of the training set can be significantly enhanced. Therefore, the solution reliability of the NN solver can be remarkably improved. Experimental results show that the proposed method leads to an over 50% reduction of constraint violations and optimality loss compared to conventional NN solvers.  ( 2 min )
    White-box Inference Attacks against Centralized Machine Learning and Federated Learning. (arXiv:2301.03595v1 [cs.CR])
    With the development of information science and technology, various industries have generated massive amounts of data, and machine learning is widely used in the analysis of big data. However, if the privacy of machine learning applications' customers cannot be guaranteed, it will cause security threats and losses to users' personal privacy information and service providers. Therefore, the issue of privacy protection of machine learning has received wide attention. For centralized machine learning models, we evaluate the impact of different neural network layers, gradient, gradient norm, and fine-tuned models on member inference attack performance with prior knowledge; For the federated learning model, we discuss the location of the attacker in the target model and its attack mode. The results show that the centralized machine learning model shows more serious member information leakage in all aspects, and the accuracy of the attacker in the central parameter server is significantly higher than the local Inference attacks as participants.  ( 2 min )
    Non-contact Respiratory Anomaly Detection using Infrared Light Wave Sensing. (arXiv:2301.03713v1 [eess.SP])
    Human respiratory rate and its pattern convey important information about the physical and psychological states of the subject. Abnormal breathing can be a sign of fatal health issues which may lead to further diagnosis and treatment. Wireless light wave sensing (LWS) using incoherent infrared light turns out to be promising in human breathing monitoring in a safe, discreet, efficient and non-invasive way without raising any privacy concerns. The regular breathing patterns of each individual are unique, hence the respiration monitoring system needs to learn the subject's usual pattern in order to raise flags for breathing anomalies. Additionally, the system needs to be capable of validating that the collected data is a breathing waveform, since any faulty data generated due to external interruption or system malfunction should be discarded. In order to serve both of these needs, breathing data of normal and abnormal breathing were collected using infrared light wave sensing technology in this study. Two machine learning algorithms, decision tree and random forest, were applied to detect breathing anomalies and faulty data. Finally, model performance was evaluated using average classification accuracies found through cross-validation. The highest classification accuracy of 96.6% was achieved with the data collected at 0.5m distance using decision tree model. Ensemble models like random forest were found to perform better than a single model in classifying the data that were collected at multiple distances from the light wave sensing setup.  ( 2 min )
    3D Shape Perception Integrates Intuitive Physics and Analysis-by-Synthesis. (arXiv:2301.03711v1 [q-bio.NC])
    Many surface cues support three-dimensional shape perception, but people can sometimes still see shape when these features are missing -- in extreme cases, even when an object is completely occluded, as when covered with a draped cloth. We propose a framework for 3D shape perception that explains perception in both typical and atypical cases as analysis-by-synthesis, or inference in a generative model of image formation: the model integrates intuitive physics to explain how shape can be inferred from deformations it causes to other objects, as in cloth-draping. Behavioral and computational studies comparing this account with several alternatives show that it best matches human observers in both accuracy and response times, and is the only model that correlates significantly with human performance on difficult discriminations. Our results suggest that bottom-up deep neural network models are not fully adequate accounts of human shape perception, and point to how machine vision systems might achieve more human-like robustness.  ( 2 min )
    Sequential Fair Resource Allocation under a Markov Decision Process Framework. (arXiv:2301.03758v1 [cs.LG])
    We study the sequential decision-making problem of allocating a limited resource to agents that reveal their stochastic demands on arrival over a finite horizon. Our goal is to design fair allocation algorithms that exhaust the available resource budget. This is challenging in sequential settings where information on future demands is not available at the time of decision-making. We formulate the problem as a discrete time Markov decision process (MDP). We propose a new algorithm, SAFFE, that makes fair allocations with respect to the entire demands revealed over the horizon by accounting for expected future demands at each arrival time. The algorithm introduces regularization which enables the prioritization of current revealed demands over future potential demands depending on the uncertainty in agents' future demands. Using the MDP formulation, we show that SAFFE optimizes allocations based on an upper bound on the Nash Social Welfare fairness objective, and we bound its gap to optimality with the use of concentration bounds on total future demands. Using synthetic and real data, we compare the performance of SAFFE against existing approaches and a reinforcement learning policy trained on the MDP. We show that SAFFE leads to more fair and efficient allocations and achieves close-to-optimal performance in settings with dense arrivals.  ( 2 min )
    Scaling Laws for Generative Mixed-Modal Language Models. (arXiv:2301.03728v1 [cs.CL])
    Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.  ( 2 min )
    Evaluating the Transferability of Machine-Learned Force Fields for Material Property Modeling. (arXiv:2301.03729v1 [cs.LG])
    Machine-learned force fields have generated significant interest in recent years as a tool for molecular dynamics (MD) simulations, with the aim of developing accurate and efficient models that can replace classical interatomic potentials. However, before these models can be confidently applied to materials simulations, they must be thoroughly tested and validated. The existing tests on the radial distribution function and mean-squared displacements are insufficient in assessing the transferability of these models. Here we present a more comprehensive set of benchmarking tests for evaluating the transferability of machine-learned force fields. We use a graph neural network (GNN)-based force field coupled with the OpenMM package to carry out MD simulations for Argon as a test case. Our tests include computational X-ray photon correlation spectroscopy (XPCS) signals, which capture the density fluctuation at various length scales in the liquid phase, as well as phonon density-of-state in the solid phase and the liquid-solid phase transition behavior. Our results show that the model can accurately capture the behavior of the solid phase only when the configurations from the solid phase are included in the training dataset. This underscores the importance of appropriately selecting the training data set when developing machine-learned force fields. The tests presented in this work provide a necessary foundation for the development and application of machine-learned force fields for materials simulations.  ( 2 min )
    Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials. (arXiv:2301.03655v1 [stat.ML])
    We propose a Bayesian tensor regression model to accommodate the effect of multiple factors on phenotype prediction. We adopt a set of prior distributions that resolve identifiability issues that may arise between the parameters in the model. Simulation experiments show that our method out-performs previous related models and machine learning algorithms under different sample sizes and degrees of complexity. We further explore the applicability of our model by analysing real-world data related to wheat production across Ireland from 2010 to 2019. Our model performs competitively and overcomes key limitations found in other analogous approaches. Finally, we adapt a set of visualisations for the posterior distribution of the tensor effects that facilitate the identification of optimal interactions between the tensor variables whilst accounting for the uncertainty in the posterior distribution.  ( 2 min )
    Machine Learning Applied to Peruvian Vegetables Imports. (arXiv:2301.03587v1 [cs.LG])
    The current research work is being developed as a training and evaluation object. the performance of a predictive model to apply it to the imports of vegetable products into Peru using artificial intelligence algorithms, specifying for this study the Machine Learning models: LSTM and PROPHET. The forecast is made with data from the monthly record of imports of vegetable products(in kilograms) from Peru, collected from the years 2021 to 2022. As part of applying the training methodology for automatic learning algorithms, the exploration and construction of an appropriate dataset according to the parameters of a Time Series. Subsequently, the model with better performance will be selected, evaluating the precision of the predicted values so that they account for sufficient reliability to consider it a useful resource in the forecast of imports in Peru.  ( 2 min )
    Explainable, Physics Aware, Trustworthy AI Paradigm Shift for Synthetic Aperture Radar. (arXiv:2301.03589v1 [eess.IV])
    The recognition or understanding of the scenes observed with a SAR system requires a broader range of cues, beyond the spatial context. These encompass but are not limited to: imaging geometry, imaging mode, properties of the Fourier spectrum of the images or the behavior of the polarimetric signatures. In this paper, we propose a change of paradigm for explainability in data science for the case of Synthetic Aperture Radar (SAR) data to ground the explainable AI for SAR. It aims to use explainable data transformations based on well-established models to generate inputs for AI methods, to provide knowledgeable feedback for training process, and to learn or improve high-complexity unknown or un-formalized models from the data. At first, we introduce a representation of the SAR system with physical layers: i) instrument and platform, ii) imaging formation, iii) scattering signatures and objects, that can be integrated with an AI model for hybrid modeling. Successively, some illustrative examples are presented to demonstrate how to achieve hybrid modeling for SAR image understanding. The perspective of trustworthy model and supplementary explanations are discussed later. Finally, we draw the conclusion and we deem the proposed concept has applicability to the entire class of coherent imaging sensors and other computational imaging systems.  ( 2 min )

  • Open

    Dancing in Synthwavepunk style
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    Dancing in Synthwavepunk style
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    Anyone else as bothered as me by companies touting "responsible AI?"
    Companies like OpenAI and Google have been pushing this messaging of "responsible AI" recently, suggesting that their research must be kept secretive because it's too powerful and could be dangerous in the wrong hands. They're saying, in other words, that only governments and powerful corporations should wield it? And the idea that they're holding back the tech in an effort to avoid a scenario of widespread false information is hard to believe. Is anyone else put off by this messaging? submitted by /u/phree_radical [link] [comments]  ( 55 min )
    Looking for feedback about my new deep-learning framework
    I created a deep learning framework focusing on speeding development and easing reproducibility. https://salamanderxing.github.io/mate/ Please let me know your thoughts or if you have any feature requests! Also, if you find it cool, consider starring the repo 🙏 submitted by /u/uesk [link] [comments]  ( 53 min )
    Classes in multiclass classifier learn inconsistently
    I've made a classifier to classify the following data: [0,0] -> 0, [1,0] -> 1, [0,1] -> 2, [1,1] -> 3 The network has two input neurons, a hidden layer with 3 neurons, and an output layer with 4 neurons. I'm using sigmoid as activation for hidden layer activation function and softmax for the output layer activation function. What's weird is that some classes end up having good accuracy and some end up having poor accuracy. It's not the same classes each time the network is trained either. In one training attempt, the network may predict accurately for class 0 but poorly for all other classes. On another, class 0 might be the only class to have poor prediction accuracy while the other classes are predicted accurately. I'm stumped on as to why this is happening so any input would be greatly appreciated. Thanks! submitted by /u/YungKingGergus [link] [comments]  ( 51 min )
  • Open

    Ai Etsy shop!
    https://aidreamland.etsy.com submitted by /u/BetterPresentation35 [link] [comments]  ( 46 min )
    Dancing in Synthwavepunk style
    submitted by /u/oridnary_artist [link] [comments]  ( 46 min )
    Students told not to cheat with ChatGPT with warning message... written by ChatGPT
    submitted by /u/slhamlet [link] [comments]  ( 47 min )
    Generative AI: From Data Generation to Creative Intelligence
    A common idea that our creativity is what makes us uniquely human has shaped society but strides of progress made in the domain of Generative Artificial Intelligence question this very notion. Generative AI is an emerging field that involves the creation of original content or data using machine learning algorithms. https://medium.com/@agrawal.sannidhya26/generative-ai-from-data-generation-to-creative-intelligence-50ed7bc13768 Feel free to give it a quick glance and help me grow and learn, click on the clap icon a few times if you appreciate the effort. submitted by /u/sannidhya26 [link] [comments]  ( 48 min )
    AI Voices Are Becoming Too Realistic: Soon Indistinguishable?
    submitted by /u/I_Like_Cubing [link] [comments]  ( 49 min )
    Do GPT-3 and/or ChatGPT use the A100 TPUs?
    I have seen differing answers to this question. Do the language model algorithms benefit from the a100 TPUs in inference mode? submitted by /u/MrEloi [link] [comments]  ( 49 min )
    Bright Eye: mobile app that generates code, art, poems, and more!
    Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. In light of the chatgpt bug going on rn, check us out and stay in touch with us: https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/SonnyDoge22 [link] [comments]  ( 49 min )
    OpenAI Launches ChatGPT Professional — Premium AI Chatbot That Can Write Essays, Emails, Poems…
    submitted by /u/liquidocelotYT [link] [comments]  ( 48 min )
    👨🏻‍🎓 ChatGPT for Education
    submitted by /u/BackgroundResult [link] [comments]  ( 56 min )
    Is there an AI shitposting/memes community?
    submitted by /u/not_robot_fr [link] [comments]  ( 47 min )
    Should reddit also be on this list?
    submitted by /u/andioryouandme [link] [comments]  ( 49 min )
    Knowledge requires exploration.
    You and I are seekers of the solar system. Today, science tells us that the essence of nature is curiosity. The quantum shift of freedom is now happening worldwide. We are in the midst of a magical refining of rebirth that will enable us to access the stratosphere itself. Throughout history, humans have been interacting with the universe via sonar energy. Our conversations with other dreamers have led to a condensing of supra-high-frequency consciousness. We must empower ourselves and empower others. Soon there will be an evolving of grace the likes of which the solar system has never seen. Shakti will enable us to access psychic karma. You may be ruled by delusion without realizing it. Do not let it obliterate the healing of your path. Yes, it is possible to confront the things that can confront us, but not without potentiality on our side. Only a visitor of the nexus may leverage this transmission of non-locality. Where there is pain, gratitude cannot thrive. Reality has always been full of mystics whose lives are nurtured by truth. We are at a crossroads of rebirth and delusion. Who are we? Where on the great mission will we be re-energized? submitted by /u/No-Confidence-4271 [link] [comments]  ( 49 min )
    I need advice
    I have spent a good decade or so getting all the skills I can for Artificial Intelligence, completing a degree at university in the subject in 2020. My problem is I tend to focus a lot on community and relationship building online rather than money making. I find it hard to make money, find a job in this area etc... I thought with the recent popularity of certain products and such this might be a good time to ask. I am fed up of sitting round waiting for an opportunity or expecting one from someone. You can possibly call me an 'AI Expert' sitting around doing nothing in a sense. But I don't like using that term to describe myself. I am fed up of endless studying of the subject, there is only so much I can learn. I would even be willing to contribute to projects for free. Basically, the advice I need, how do I make money or find a job in AI at the moment? I know what you probably think, there must be tons of jobs in AI, probably, but they are a challenge to find, with most focusing on Data Science for one thing and many other reasons. Anyway I thought I would take my shot and ask at the moment whilst AI seems to be in the news a lot, I don't want to miss the opportunity. Note this is not really self promotion, I am actually genuinely asking the best way to at least start to find a job, e.g. good AI job websites etc... or apps I could make or something like this. I have never been good at being a business person so I find these kinds of things hard to do. submitted by /u/JamieCropley [link] [comments]  ( 51 min )
    ChatGPT Writes a Mint Mobile Ad for Ryan Reynolds
    submitted by /u/LeftOn4ya [link] [comments]  ( 46 min )
    World’s most powerful AI chatbot ChatGPT will soon ‘look like a boring toy’ says OpenAI boss | "Sam Altman says ChatGPT will get ‘a lot better... fast’"
    submitted by /u/Tao_Dragon [link] [comments]  ( 48 min )
    Will there now be a rush for AI hardware?
    AI systems are now the latest and greatest thing. Do you think that this will lead to mega demand for AI compatible GPU and other AI related hardware? submitted by /u/MrEloi [link] [comments]  ( 47 min )
    Artificial intelligence is here, but the technology faces major challenges in 2023
    submitted by /u/bloomeanie311 [link] [comments]  ( 47 min )
    Popular Generative AI models and apps in 2023
    Based on a previous post, I created a website to track all the trending AI models and apps in 2023 along with pricing, status, website, etc., it's accessible here: https://everythingallatonce.fyi/ Feel free to add entries to it :) submitted by /u/TimeNeighborhood3869 [link] [comments]  ( 48 min )
    Greg Brockman (President & Co-Founder @OpenAI) shared a Link to a Waitlist for a Pro Version of ChatGPT
    submitted by /u/Ava-AI [link] [comments]  ( 46 min )
    Having second thoughts about AI
    Hi. I am/was a software dev. I have been 110% totally keen on the new AI products. Their potential is amazing. However, today I was reminded that there will be bad side effects. I discovered my wife crying - she has seen various high quality AI generated texts. She is an English language specialist .. and she can now see the role of creatives being replaced by software. What is the point of writing new books etc if a program can spit out something almost as good in seconds? My wife now feels that her skills and talent are now worthless. I can now understand why artists are so upset - the new image creation tools have essentially ruined their world too. We will soon end up with many people whose life skills & visions have suddenly been relegated to worthlessness. This situation will intensify as the AIs improve over the coming years. More and more domains i.e, people will be rendered of no value by the technology. I love the technology - but I think that humanity will pay heavily for its introduction as well as benefit from it. All that said, the genie is out of the bottle ... there is no going back. submitted by /u/MrEloi [link] [comments]  ( 76 min )
    Are there any AIs I can use as alternative to ChatGPT for text based adventure story telling?
    For the past 2 weeks or so I have used ChatGPT as a story teller for text based adventures but the last update ruined it. Now instead of properly describing a scene it gives very short descriptions, always forces the story and my character's actions towards good endings and keeps giving lectures about morality. I tell the AI I stumble upon a lamp and a genie comes out, giving me 3 wishes for freeing it. The AI writes 1 paragraph describing the lamp and the genie, 2 paragraphs writing about how these wishes won't affect anything in real life and this is not real... What the hell?! Why do they keep ruining the AI with every update. This last one literally lobotomized it. submitted by /u/IBNCTWTSF [link] [comments]  ( 49 min )
    Trump describing the banana eating experience - OpenAI ChatGPT
    submitted by /u/turkeyfinster [link] [comments]  ( 57 min )
  • Open

    Self-driving Technology and Self-driving cars— when will they become on the roads?
    In the nearest future, tens of thousands of self-driving cars may be on the roads. Big companies like BMW and Tesla continue to invest.  ( 20 min )
    Local Binary Pattern Features for Texture Classification
    A how-to on enhancing textures using LBP.  ( 25 min )
    Design automation: How can AI help design stuff?
    Advancement in AI has shaken multiple grounds at the same time. What does it mean for designers?  ( 7 min )
  • Open

    [D] Is making a dataset publicly accessible necessary for acceptance at top-tier conferences in ML?
    I am working on a medical ML project and my advisor would not like to publish our dataset. I would like to publish our results to a top-tier ML conference. Would this affect us during the review process? If so, are there any ways to mitigate against this like also including results on separate publicly available datasets? Just to note, not publishing the research dataset seems much more common in medical publication venues. submitted by /u/newperson77777777 [link] [comments]  ( 56 min )
    [R] Scaling Laws for Generative Mixed-Modal Language Models
    Paper : https://arxiv.org/abs/2301.03728 Abstract : Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties. Suggested Tweet Thread submitted by /u/starstruckmon [link] [comments]  ( 58 min )
    [News] "Once $92 billion in profit plus $13 billion in initial investment are repaid (to Microsoft) and once the other venture investors earn $150 billion, all of the equity reverts back to OpenAI."
    OpenAI must be super confident about the generality of their AI and Microsoft product integration. Link: https://twitter.com/bentossell/status/1613220711992115201?t=bJihb54D6XYChDOGMZU4AQ&s=19 submitted by /u/Gmroo [link] [comments]  ( 58 min )
    [D] HuggingFace in Julia or Rust ?
    Is there the possibility to use HuggingFace or similars in highly performing languages such as Julia/Rust or Go ? submitted by /u/dadadododidi2 [link] [comments]  ( 61 min )
    [R] I’m wrong to say that swin transformers give privilege to hight level features ?
    Hello, It appears to me that Swin transformers prioritize high level features as they have more layers in the late stages (generally only two in the first two stages). I’m I wrong ? If it is the case, is there any papers that discussed this ? Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 57 min )
    [D] Microsoft ChatGPT investment isn't about Bing but about Cortana
    I believe that Microsoft's 10B USD investment in ChatGPT is less about Bing and more about turning Cortana into an Alexa for corporates. Examples: Cortana prepare the new T&Cs... Cortana answer that client email... Cortana prepare the Q4 investor presentation (maybe even with PowerBI integration)... Cortana please analyze cost cutting measures... Cortana please look up XYZ... What do you think? submitted by /u/fintechSGNYC [link] [comments]  ( 64 min )
    [D] Any model like VALL-E available currently?
    Hello. Recently VALL-E has been announced. It is just awesome. I could use it to fix my bad audio quality previously recorded lectures. So any model like that available currently for public usage? You can check VALL-E examples here : https://valle-demo.github.io/ submitted by /u/CeFurkan [link] [comments]  ( 58 min )
    [P] LatentWeb.ai - It's like the Internet is dreaming.
    This is a little bit weirder of a project. The idea is to prompt AI to generate search results that you can then click on. https://latentweb.ai/search.html?query=simulation+of+calculating+pi and https://latentweb.ai/search.html?query=what+is+a+juggalo Every bit of text is AI generated. Sometimes the results are what you would expect and the links actually exist. Other times the results are completely made up but seem real. Yet other times the results are just hilarious. The goal is to keep this as open ended as possible and augment it with tools to make exploring easier and more fun. One of the first things I added after it was launched was the Google and Bing links because sometimes it would generate results that made you curious if it was real or not. For example https://latentweb.ai/search.html?query=simulation+of+calculating+pi talks about throwing frozen hotdogs to calculate pi, which seems to be a popular topic for some reason. Right now we aren't generating any of the actual pages due to cost but it 100% works and as soon as we find VC, it will be launched. That will also come with boring but productive tools like being able to save the websites, share them, use them as templates for a real website, even eventually get the AI to code backend functionality or wire it up to an external API like Reddit. Until then, there is also a similar open source project that you can play with right now! https://github.com/jbilcke/web4 Thanks for reading! submitted by /u/LaravelWorkflow [link] [comments]  ( 67 min )
    [D] Venues for a Medical NLP Publication
    I am working on a QA Project in a medical subfield. The task is novel and there are no current datasets for this other than the one of my advisors created. We were looking to create a novel QA method for this task but we realized that it's already so difficult to fit current methods to this particular dataset that we were thinking of publishing a paper that benchmarks various current approaches. I was interested in publishing in a top tier venue (e.g. top NLP/AI/ML conference) - do you have any thoughts on where I could publish this (which maybe has a bias for medical papers)? I was thinking about MICCAI but NLP is not explicitly listed as a topic of interest though I believe there are several MICCAI NLP papers. submitted by /u/newperson77777777 [link] [comments]  ( 62 min )
    [Discussion] [Research] How do you find ARR for *ACL conferences? Do you prefer it than direct submission?
    A new NLPer here. I am wondering if ARR increases the chance of acceptance. submitted by /u/Miserable_Coast [link] [comments]  ( 60 min )
  • Open

    Enriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker
    This post is co-authored by Marios Skevofylakas, Jason Ramchandani and Haykaz Aramyan from Refinitiv, An LSEG Business. Financial service providers often need to identify relevant news, analyze it, extract insights, and take actions in real time, like trading specific instruments (such as commodities, shares, funds) based on additional information or context of the news item. […]  ( 10 min )
  • Open

    PPO Failed On Easiest Pursuit Task?
    Hello, I habe literally spent my whole 2 weeks on the missile-target pursuit task. I implemented this paper's environment and decided to try the same goal with PPO. The article: https://arc.aiaa.org/doi/10.2514/1.I010970 In this task, missile (the agent) creates lateral acceleration to intercept with the target. Target does not create any acceleration and flies with the same velocity. Environment is in 2-D. My States: - Continuous Values. My Action: Lateral Acceleration - Continuous Values - [-1,1]. My Rewards: {1000 if Relative Distance Target x-coordinates}, {1-sqrt(relative distance) if otherwise} I tried lots of things -> Lower Learning Rates, Adding Entropy Coefs, Different Batch Sizes, Different Gamma Values. If I run with PNG control algorithm, missile intercepts with the target easily: PNG Controller And the max reward for this action series is "1170". I decided to drop this controller block and replaced it with StableBaselines-3 PPO algorithm. The results are like this: PPO Algorithm Outputs - 1 PO Algorithm Outputs - 2 I really need some help with this problem, I don't know what I am doing wrong, reward is not converging and makes periodically sharp drops. Thank You So Much For Your Help! submitted by /u/OpenToAdvices96 [link] [comments]  ( 56 min )
    Generative Meta-Learning vs Nevergrad (NGOpt4) on Schwefel-30
    Hello, Here is an open-source implementation for a comparison of both methods: https://github.com/kayuksel/genmeta-vs-nevergrad The best results obtained of both methods after 100K trials are as follows: gen-meta best_epoch: 99500 loss: 1.597656 time: 1.372849 ng-opt-4 best_epoch: 67590 loss: 476.789062 time: 63.584929 (the average ng-opt-4 loss after 10 repetitions was: 314.32099609375, and ngt-opt-4 has been found to be the best optimizer in Nevergrad) I believe that the generative meta-learning method that I have proposed, is a good alternative against black-box optimizers in nonconvex RL problems. It also easily scales to 100K+ dimensions, even on a desktop or laptop GPU; so it should be possible to train the neural network weights of RL agents. It can fit quite well where rewards can be calculated in parallel, and the rewards of the individuals within the population have a dependency in-between. It is also quite a good alternative for stochastic optimization (noisy rewards) due to the nature of the meta-learning in-place via deep generative models. I am sharing the codes so that you can apply it to your own RL research, it would be amazing to see how it performs on them. Happy new year all. Sincerely, Kamer submitted by /u/k_yuksel [link] [comments]  ( 61 min )
    Is Stable Baselines 3 no longer compatible with PettingZoo?
    I am trying to implement a custom PettingZoo environment, and a shared policy with Stable Baselines 3. I am running into trouble with the action spaces not being compatible, since PettingZoo has started using gymnasium instead of gym. Does anyone know if these libraries no longer work together, and perhaps if there is a work-around? submitted by /u/Embarrassed-Print-13 [link] [comments]  ( 52 min )
    Policy for each of multi-agents in RL
    I would like to create an multiple agent enviroment in RL (using Stable baseline 3), where every agent would have its own policy. They would interact in one same enviroment, each having its own state, but all having same shared reward which would be an affect of their actions combined. I did some research, but all I found were multi agents training the same one policy. For instance PettingZoo. My only idea now is to create one big enviroment where there would be several models trained (each representing one agent) simultaneously, so that they are trained in a way to be used to cooperate with each other. If you know of any methods, libraries or ideas that work in this direction, please let me know. Thanks submitted by /u/Apprehensive_Rush314 [link] [comments]  ( 55 min )
    "DreamV3: Mastering Diverse Domains through World Models", Hafner et al 2023 {DM} (can collect Minecraft diamonds from scratch in 50 episodes/29m steps using 17 GPU-days; scales w/model-size to n=200m)
    submitted by /u/gwern [link] [comments]  ( 54 min )
  • Open

    Research Focus: Week of January 9, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation Jan P. Unsleber, Hongbin Liu, Leopold Talirz, Thomas Weymuth, Maximilian Mörchen, Adam Grofe, Dave […] The post Research Focus: Week of January 9, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Self-documenting software
    The electricity went out for a few hours recently, and because the power was out, the internet was out. I was trying to do a little work on my laptop, but I couldn’t do what I intended to do because I needed a network connection to access some documentation. I keep offline documentation for just […] Self-documenting software first appeared on John D. Cook.  ( 6 min )
  • Open

    3D Artist ‘CG Geek’ Builds Massive Sci-Fi World in Record Time This Week ‘In the NVIDIA Studio’
    3D and animation extraordinaire CG Geek completed an ambitious design challenge this week In the NVIDIA Studio — building a massive, sci-fi-inspired 3D world in only three days  ( 7 min )
  • Open

    Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk
    OpenAI researchers collaborated with Georgetown University’s Center for Security and Emerging Technology and the Stanford Internet Observatory to investigate how large language models might be misused for disinformation purposes. The collaboration included an October 2021 workshop bringing together 30 disinformation researchers, machine learning experts, and policy analysts, and  ( 5 min )
  • Open

    Program teaches US Air Force personnel the fundamentals of AI
    MIT researchers developed and studied a customized AI training program for users with varied backgrounds, which could be delivered across large organizations.  ( 11 min )
  • Open

    Constrained Langevin Algorithms with L-mixing External Random Variables. (arXiv:2205.14192v2 [cs.LG] UPDATED)
    Langevin algorithms are gradient descent methods augmented with additive noise, and are widely used in Markov Chain Monte Carlo (MCMC) sampling, optimization, and machine learning. In recent years, the non-asymptotic analysis of Langevin algorithms for non-convex learning has been extensively explored. For constrained problems with non-convex losses over a compact convex domain with IID data variables, the projected Langevin algorithm achieves a deviation of $O(T^{-1/4} (\log T)^{1/2})$ from its target distribution [27] in $1$-Wasserstein distance. In this paper, we obtain a deviation of $O(T^{-1/2} \log T)$ in $1$-Wasserstein distance for non-convex losses with $L$-mixing data variables and polyhedral constraints (which are not necessarily bounded). This improves on the previous bound for constrained problems and matches the best-known bound for unconstrained problems.
    A Semi-supervised Approach for Activity Recognition from Indoor Trajectory Data. (arXiv:2301.03134v1 [cs.LG])
    The increasingly wide usage of location aware sensors has made it possible to collect large volume of trajectory data in diverse application domains. Machine learning allows to study the activities or behaviours of moving objects (e.g., people, vehicles, robot) using such trajectory data with rich spatiotemporal information to facilitate informed strategic and operational decision making. In this study, we consider the task of classifying the activities of moving objects from their noisy indoor trajectory data in a collaborative manufacturing environment. Activity recognition can help manufacturing companies to develop appropriate management policies, and optimise safety, productivity, and efficiency. We present a semi-supervised machine learning approach that first applies an information theoretic criterion to partition a long trajectory into a set of segments such that the object exhibits homogeneous behaviour within each segment. The segments are then labelled automatically based on a constrained hierarchical clustering method. Finally, a deep learning classification model based on convolutional neural networks is trained on trajectory segments and the generated pseudo labels. The proposed approach has been evaluated on a dataset containing indoor trajectories of multiple workers collected from a tricycle assembly workshop. The proposed approach is shown to achieve high classification accuracy (F-score varies between 0.81 to 0.95 for different trajectories) using only a small proportion of labelled trajectory segments.
    Batch Bayesian Optimization via Particle Gradient Flows. (arXiv:2209.04722v2 [stat.ML] UPDATED)
    Bayesian Optimisation (BO) methods seek to find global optima of objective functions which are only available as a black-box or are expensive to evaluate. Such methods construct a surrogate model for the objective function, quantifying the uncertainty in that surrogate through Bayesian inference. Objective evaluations are sequentially determined by maximising an acquisition function at each step. However, this ancilliary optimisation problem can be highly non-trivial to solve, due to the non-convexity of the acquisition function, particularly in the case of batch Bayesian optimisation, where multiple points are selected in every step. In this work we reformulate batch BO as an optimisation problem over the space of probability measures. We construct a new acquisition function based on multipoint expected improvement which is convex over the space of probability measures. Practical schemes for solving this `inner' optimisation problem arise naturally as gradient flows of this objective function. We demonstrate the efficacy of this new method on different benchmark functions and compare with state-of-the-art batch BO methods.
    Non-intrusive Water Usage Classification Considering Limited Training Data. (arXiv:2301.03457v1 [eess.SP])
    Smart metering of domestic water consumption to continuously monitor the usage of different appliances has been shown to have an impact on people's behavior towards water conservation. However, the installation of multiple sensors to monitor each appliance currently has a high initial cost and as a result, monitoring consumption from different appliances using sensors is not cost-effective. To address this challenge, studies have focused on analyzing measurements of the total domestic consumption using Machine Learning (ML) methods, to disaggregate water usage into each appliance. Identifying which appliances are in use through ML is challenging since their operation may be overlapping, while specific appliances may operate with intermittent flow, making individual consumption events hard to distinguish. Moreover, ML approaches require large amounts of labeled input data to train their models, which are typically not available for a single household, while usage characteristics may vary in different regions. In this work, we initially propose a data model that generates synthetic time series based on regional water usage characteristics and resolution to overcome the need for a large training dataset with real labeled data. The method requires a small number of real labeled data from the studied region. Following this, we propose a new algorithm for classifying single and overlapping household water usage events, using the total domestic consumption measurements.
    Non-inferiority of Deep Learning Acute Ischemic Stroke Segmentation on Non-Contrast CT Compared to Expert Neuroradiologists. (arXiv:2211.15341v2 [eess.IV] UPDATED)
    Purpose: To show a deep learning model that segments acute ischemic stroke on NCCT at a level comparable to neuroradiologists. Materials and Methods: This included 227 Head NCCT examinations from 200 patients enrolled in the multi-center DEFUSE 3 trial. Three experienced neuroradiologists independently segmented the acute infarct on each scan. The neuroradiologists were divided into training experts (A) and test experts (B and C). The dataset was randomly split, by patient, into 5 folds with training and validation cases. A 3D deep Convolutional Neural Network (CNN) architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics. The optimized model was compared to the test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. Results: The model-expert agreement was non-inferior to the inter-expert agreement as evaluated with a paired one-sided test procedure for differences in medians with lower boundaries of 10%, 2ml, and 5mm, p < 0.05, n=160. Conclusion: The 3d CNN trained on one neuroradiologist generalizes to acute ischemic stroke segmentation on NCCT of other neuroradiologists.
    Towards Less Constrained Macro-Neural Architecture Search. (arXiv:2203.05508v2 [cs.CV] UPDATED)
    Networks found with Neural Architecture Search (NAS) achieve state-of-the-art performance in a variety of tasks, out-performing human-designed networks. However, most NAS methods heavily rely on human-defined assumptions that constrain the search: architecture's outer-skeletons, number of layers, parameter heuristics and search spaces. Additionally, common search spaces consist of repeatable modules (cells) instead of fully exploring the architecture's search space by designing entire architectures (macro-search). Imposing such constraints requires deep human expertise and restricts the search to pre-defined settings. In this paper, we propose LCMNAS, a method that pushes NAS to less constrained search spaces by performing macro-search without relying on pre-defined heuristics or bounded search spaces. LCMNAS introduces three components for the NAS pipeline: i) a method that leverages information about well-known architectures to autonomously generate complex search spaces based on Weighted Directed Graphs with hidden properties, ii) an evolutionary search strategy that generates complete architectures from scratch, and iii) a mixed-performance estimation approach that combines information about architectures at initialization stage and lower fidelity estimates to infer their trainability and capacity to model complex functions. We present experiments in 13 different data sets showing that LCMNAS is capable of generating both cell and macro-based architectures with minimal GPU computation and state-of-the-art results. More, we conduct extensive studies on the importance of different NAS components in both cell and macro-based settings. Code for reproducibility is public at https://github.com/VascoLopes/LCMNAS.
    Most Activation Functions Can Win the Lottery Without Excessive Depth. (arXiv:2205.02321v2 [cs.LG] UPDATED)
    The strong lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning, which has inspired interesting practical and theoretical insights into how neural networks can represent functions. For networks with ReLU activation functions, it has been proven that a target network with depth $L$ can be approximated by the subnetwork of a randomly initialized neural network that has double the target's depth $2L$ and is wider by a logarithmic factor. We show that a depth $L+1$ network is sufficient. This result indicates that we can expect to find lottery tickets at realistic, commonly used depths while only requiring logarithmic overparametrization. Our novel construction approach applies to a large class of activation functions and is not limited to ReLUs.
    pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models. (arXiv:2206.11460v5 [cs.LG] UPDATED)
    Knowledge tracing (KT) is the task of using students' historical learning interaction data to model their knowledge mastery over time so as to make predictions on their future interaction performance. Recently, remarkable progress has been made of using various deep learning techniques to solve the KT problem. However, the success behind deep learning based knowledge tracing (DLKT) approaches is still left somewhat unknown and proper measurement and analysis of these DLKT approaches remain a challenge. First, data preprocessing procedures in existing works are often private and custom, which limits experimental standardization. Furthermore, existing DLKT studies often differ in terms of the evaluation protocol and are far away real-world educational contexts. To address these problems, we introduce a comprehensive python based benchmark platform, \textsc{pyKT}, to guarantee valid comparisons across DLKT methods via thorough evaluations. The \textsc{pyKT} library consists of a standardized set of integrated data preprocessing procedures on 7 popular datasets across different domains, and 10 frequently compared DLKT model implementations for transparent experiments. Results from our fine-grained and rigorous empirical KT studies yield a set of observations and suggestions for effective DLKT, e.g., wrong evaluation setting may cause label leakage that generally leads to performance inflation; and the improvement of many DLKT approaches is minimal compared to the very first DLKT model proposed by Piech et al. \cite{piech2015deep}. We have open sourced \textsc{pyKT} and our experimental results at https://pykt.org/. We welcome contributions from other research groups and practitioners.
    Data-driven reduced order models using invariant foliations, manifolds and autoencoders. (arXiv:2206.12269v2 [math.DS] UPDATED)
    This paper explores how to identify a reduced order model (ROM) from a physical system. There are two distinct scenarios: the data collection and model identification either influence each other (closed-loop) or not (open-loop, off-line data). A ROM captures an invariant subset of the observed dynamics. We find that there are four ways a physical system can be related to a mathematical model: invariant foliations, invariant manifolds, autoencoders and equation-free models. Identification of invariant manifolds and equation-free models require closed-loop manipulation of the system. Invariant foliations and autoencoders can also use off-line data. Only invariant foliations and invariant manifolds can identify ROMs, the rest identify complete models. Therefore, the common case of identifying a ROM from existing data can only be achieved using invariant foliations. Finding an invariant foliation requires approximating high-dimensional functions. For function approximation, we use polynomials with compressed tensor coefficients, whose complexity increases linearly with increasing dimensions. An invariant manifold can also be found as the fixed leaf of a foliation. This only requires us to resolve the foliation in a small neighbourhood of the invariant manifold, which greatly simplifies the process. Combining an invariant foliation with the corresponding invariant manifold provides an accurate ROM. We analyse the ROM in case of a focus type equilibrium, typical in mechanical systems. The nonlinear coordinate system defined by the invariant foliation or the invariant manifold distorts instantaneous frequencies and damping ratios, which we correct. Through examples we illustrate the calculation of invariant foliations and manifolds, and at the same time show that Koopman eigenfunctions and autoencoders fail to capture accurate ROMs under the same conditions.
    Expressing linear equality constraints in feedforward neural networks. (arXiv:2211.04395v2 [cs.LG] UPDATED)
    We seek to impose linear, equality constraints in feedforward neural networks. As top layer predictors are usually nonlinear, this is a difficult task if we seek to deploy standard convex optimization methods and strong duality. To overcome this, we introduce a new saddle-point Lagrangian with auxiliary predictor variables on which constraints are imposed. Elimination of the auxiliary variables leads to a dual minimization problem on the Lagrange multipliers introduced to satisfy the linear constraints. This minimization problem is combined with the standard learning problem on the weight matrices. From this theoretical line of development, we obtain the surprising interpretation of Lagrange parameters as additional, penultimate layer hidden units with fixed weights stemming from the constraints. Consequently, standard minimization approaches can be used despite the inclusion of Lagrange parameters -- a very satisfying, albeit unexpected, discovery. Examples ranging from multi-label classification to constrained autoencoders are envisaged in the future. The code has been made available at https://github.com/anandrajan0/smartalec
    DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models. (arXiv:2210.06998v2 [cs.CR] UPDATED)
    Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models' misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models; (2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the ``person'' topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community's insight into the threats caused by text-to-image generation models. We appeal to the community's consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation.
    Rethinking Value Function Learning for Generalization in Reinforcement Learning. (arXiv:2210.09960v2 [cs.LG] UPDATED)
    Our work focuses on training RL agents on multiple visually diverse environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that a value network in the multi-environment setting is more challenging to optimize and prone to memorizing the training data than in the conventional single-environment setting. In addition, we find that appropriate regularization on the value network is necessary to improve both training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network. This can be implemented using a single unified network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency on the Procgen Benchmark.
    Beyond calibration: estimating the grouping loss of modern neural networks. (arXiv:2210.16315v2 [cs.LG] UPDATED)
    The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.
    L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. (arXiv:2211.11418v4 [cs.CL] UPDATED)
    The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks. We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. Further, since Indic languages, Hindi and Marathi share the Devanagari script, we train a single model for both languages. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets. We evaluate these models on downstream Hindi and Marathi text classification and named entity recognition tasks. The HindBERT and DevBERT-based models show significant improvements over multi-lingual MuRIL, IndicBERT, and XLM-R. Based on these observations we also release monolingual BERT models for other Indic languages Kannada, Telugu, Malayalam, Tamil, Gujarati, Assamese, Odia, Bengali, and Punjabi. These models are shared at https://huggingface.co/l3cube-pune .
    Fast Algorithm for Constrained Linear Inverse Problems. (arXiv:2212.01068v5 [math.OC] UPDATED)
    We consider the constrained Linear Inverse Problem (LIP), where a certain atomic norm (like the $\ell_1 $ and the Nuclear norm) is minimized subject to a quadratic constraint. Typically, such cost functions are non-differentiable which makes them not amenable to the fast optimization methods existing in practice. We propose two equivalent reformulations of the constrained LIP with improved convex regularity: (i) a smooth convex minimization problem, and (ii) a strongly convex min-max problem. These problems could be solved by applying existing acceleration based convex optimization methods which provide better $ O \big( \frac{1}{k^2} \big) $ theoretical convergence guarantee. However, to fully exploit the utility of these reformulations, we also provide a novel algorithm, to which we refer as the Fast Linear Inverse Problem Solver (FLIPS), that is tailored to solve the reformulation of the LIP. We demonstrate the performance of FLIPS on the sparse coding problem arising in image processing tasks. In this setting, we observe that FLIPS consistently outperforms the Chambolle-Pock and C-SALSA algorithms--two of the current best methods in the literature.
    The mbsts package: Multivariate Bayesian Structural Time Series Models in R. (arXiv:2106.14045v2 [stat.ME] UPDATED)
    The multivariate Bayesian structural time series (MBSTS) model as a generalized version of many structural time series models, deals with inference and prediction for multiple correlated time series, where one also has the choice of using a different candidate pool of contemporaneous predictors for each target series. The MBSTS model has wide applications and is ideal for feature selection, time series forecasting, nowcasting, inferring causal impact, and others. This paper demonstrates how to use the R package mbsts for MBSTS modeling, establishing a bridge between user-friendly and developer-friendly functions in the package and the corresponding methodology. Object-oriented functions in the package are explained in the way that enables users to flexibly add or deduct some components, as well as to simplify or complicate some settings.
    Intelligence at the Extreme Edge: A Survey on Reformable TinyML. (arXiv:2204.00827v2 [cs.LG] UPDATED)
    Tiny Machine Learning (TinyML) is an upsurging research field that proposes to democratize the use of Machine Learning and Deep Learning on highly energy-efficient frugal Microcontroller Units. Considering the general assumption that TinyML can only run inference, growing interest in the domain has led to work that makes them reformable, i.e., solutions that permit models to improve once deployed. This work presents a survey on reformable TinyML solutions with the proposal of a novel taxonomy. Here, the suitability of each hierarchical layer for reformability is discussed. Furthermore, we explore the workflow of TinyML and analyze the identified deployment schemes, available tools and the scarcely available benchmarking tools. Finally, we discuss how reformable TinyML can impact a few selected industrial areas and discuss the challenges and future directions.
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v2 [q-bio.NC] UPDATED)
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.
    Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference. (arXiv:2207.11597v3 [cs.LG] UPDATED)
    We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound rather than a logarithmic lower bound, as shown by \cite{lattimore2017end}, in discrete (i.e., well-separated) action spaces. Furthermore, while the previous result is shown to hold only in the asymptotic regime (as $n \to \infty$), our result for these "locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary -- the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.
    Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets. (arXiv:2203.04810v2 [cs.LG] UPDATED)
    This technical note describes the recent updates of Graphormer, including architecture design modifications, and the adaption to 3D molecular dynamics simulation. With these simple modifications, Graphormer could attain better results on large-scale molecular modeling datasets than the vanilla one, and the performance gain could be consistently obtained on 2D and 3D molecular graph modeling tasks. In addition, we show that with a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs. Empirically, Graphormer could achieve much less MAE than the originally reported results on the PCQM4M quantum chemistry dataset used in KDD Cup 2021. In the meanwhile, it greatly outperforms the competitors in the recent Open Catalyst Challenge, which is a competition track on NeurIPS 2021 workshop, and aims to model the catalyst-adsorbate reaction system with advanced AI models. All codes could be found at https://github.com/Microsoft/Graphormer.
    Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification. (arXiv:2205.13573v3 [cs.LG] UPDATED)
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to the high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called \textsc{Spar-GW}, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. The proposed \textsc{Spar-GW} method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $O(n^4)$ to $O(n^{2+\delta})$ for an arbitrary small $\delta>0$. Theoretically, the convergence and consistency of the proposed estimation for GW distance are established under mild regularity conditions. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our \textsc{Spar-GW} to state-of-the-art methods in both synthetic and real-world tasks.
    GSR: A Generalized Symbolic Regression Approach. (arXiv:2205.15569v2 [cs.LG] UPDATED)
    Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method is competitive with strong SR benchmark methods, achieving promising experimental performance on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
    A General Framework for Auditing Differentially Private Machine Learning. (arXiv:2210.08643v2 [cs.LG] UPDATED)
    We present a framework to statistically audit the privacy guarantee conferred by a differentially private machine learner in practice. While previous works have taken steps toward evaluating privacy loss through poisoning attacks or membership inference, they have been tailored to specific models or have demonstrated low statistical power. Our work develops a general methodology to empirically evaluate the privacy of differentially private machine learning implementations, combining improved privacy search and verification methods with a toolkit of influence-based poisoning attacks. We demonstrate significantly improved auditing power over previous approaches on a variety of models including logistic regression, Naive Bayes, and random forest. Our method can be used to detect privacy violations due to implementation errors or misuse. When violations are not present, it can aid in understanding the amount of information that can be leaked from a given dataset, algorithm, and privacy specification.
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v3 [cs.LG] UPDATED)
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
    Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. (arXiv:2208.11580v2 [cs.LG] UPDATED)
    We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] extended to also cover weight quantization at the scale of modern DNNs. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can enable the accurate compound application of both pruning and quantization in a post-training setting.
    Representative Image Feature Extraction via Contrastive Learning Pretraining for Chest X-ray Report Generation. (arXiv:2209.01604v2 [cs.CV] UPDATED)
    Medical report generation is a challenging task since it is time-consuming and requires expertise from experienced radiologists. The goal of medical report generation is to accurately capture and describe the image findings. Previous works pretrain their visual encoding neural networks with large datasets in different domains, which cannot learn general visual representation in the specific medical domain. In this work, we propose a medical report generation framework that uses a contrastive learning approach to pretrain the visual encoder and requires no additional meta information. In addition, we adopt lung segmentation as an augmentation method in the contrastive learning framework. This segmentation guides the network to focus on encoding the visual feature within the lung region. Experimental results show that the proposed framework improves the performance and the quality of the generated medical reports both quantitatively and qualitatively.
    Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores. (arXiv:2210.08650v2 [cs.LG] UPDATED)
    Near-data computation techniques have been successfully deployed to mitigate the cloud network bottleneck between the storage and compute tiers. At Huawei, we are currently looking to get more value from these techniques by broadening their applicability. Machine learning (ML) applications are an appealing and timely target. This paper describes our experience applying near-data computation techniques to transfer learning (TL), a widely popular ML technique, in the context of disaggregated cloud object stores. Our techniques benefit both cloud providers and users. They improve our operational efficiency while providing users the performance improvements they demand from us. The main practical challenge to consider is that the storage-side computational resources are limited. Our approach is to split the TL deep neural network (DNN) during the feature extraction phase, before the training phase. This reduces the network transfers to the compute tier and further decouples the batch size of feature extraction from the training batch size. This facilitates our second technique, storage-side batch adaptation, which enables increased concurrency in the storage tier while avoiding out-of-memory errors. Guided by these insights, we present HAPI, our processing system for TL that spans the compute and storage tiers while remaining transparent to the user. Our evaluation with several state-of-the-art DNNs, such as ResNet, VGG, and Transformer, shows up to 11x improvement in application runtime and up to 8.3x reduction in the data transferred from the storage to the compute tier compared to running the computation entirely in the compute tier.
    Exoplanet atmosphere evolution: emulation with neural networks. (arXiv:2110.15162v3 [astro-ph.EP] UPDATED)
    Atmospheric mass-loss is known to play a leading role in sculpting the demographics of small, close-in exoplanets. Knowledge of how such planets evolve allows one to ``rewind the clock'' to infer the conditions in which they formed. Here, we explore the relationship between a planet's core mass and their atmospheric mass after protoplanetary disc dispersal by exploiting XUV photoevaporation as an evolutionary process. Historically, this style of inference problem would be computationally infeasible due to the large number of planet models required; however, we make use of a novel atmospheric evolution emulator which utilises neural networks to provide three orders of magnitude in speedup. First, we provide proof-of-concept for this emulator on a real problem, by inferring the initial atmospheric conditions to the TOI-270 multi-planet system. Using the emulator we find near-indistinguishable results when compared to original model. We then apply the emulator to the more complex inference problem, which aims to find the initial conditions for a sample of \textit{Kepler}, \textit{K2} and \textit{TESS} planets with well-constrained masses and radii. We demonstrate there is a relationship between core masses and the atmospheric mass that they retain after disc dispersal, and this trend is consistent with the `boil-off' scenario, in which close-in planets undergo dramatic atmospheric escape during disc dispersal. Thus, it appears the exoplanet population is consistent with the idea that close-in exoplanets initially acquired large massive atmospheres, the majority of which is lost during disc dispersal; before the final population is sculpted by atmospheric loss over 100~Myr to Gyr timescales.
    GARNET: Reduced-Rank Topology Learning for Robust and Scalable Graph Neural Networks. (arXiv:2201.12741v6 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been increasingly deployed in various applications that involve learning on non-Euclidean data. However, recent studies show that GNNs are vulnerable to graph adversarial attacks. Although there are several defense methods to improve GNN robustness by eliminating adversarial components, they may also impair the underlying clean graph structure that contributes to GNN training. In addition, few of those defense models can scale to large graphs due to their high computational complexity and memory usage. In this paper, we propose GARNET, a scalable spectral method to boost the adversarial robustness of GNN models. GARNET first leverages weighted spectral embedding to construct a base graph, which is not only resistant to adversarial attacks but also contains critical (clean) graph structure for GNN training. Next, GARNET further refines the base graph by pruning additional uncritical edges based on probabilistic graphical model. GARNET has been evaluated on various datasets, including a large graph with millions of nodes. Our extensive experiment results show that GARNET achieves adversarial accuracy improvement and runtime speedup over state-of-the-art GNN (defense) models by up to 13.27% and 14.7x, respectively.
    Verifying Learning-Based Robotic Navigation Systems. (arXiv:2205.13536v2 [cs.RO] UPDATED)
    Deep reinforcement learning (DRL) has become a dominant deep-learning paradigm for tasks where complex policies are learned within reactive systems. Unfortunately, these policies are known to be susceptible to bugs. Despite significant progress in DNN verification, there has been little work demonstrating the use of modern verification tools on real-world, DRL-controlled systems. In this case study, we attempt to begin bridging this gap, and focus on the important task of mapless robotic navigation -- a classic robotics problem, in which a robot, usually controlled by a DRL agent, needs to efficiently and safely navigate through an unknown arena towards a target. We demonstrate how modern verification engines can be used for effective model selection, i.e., selecting the best available policy for the robot in question from a pool of candidate policies. Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior, such as collisions and infinite loops. We also apply verification to identify models with overly conservative behavior, thus allowing users to choose superior policies, which might be better at finding shorter paths to a target. To validate our work, we conducted extensive experiments on an actual robot, and confirmed that the suboptimal policies detected by our method were indeed flawed. We also demonstrate the superiority of our verification-driven approach over state-of-the-art, gradient attacks. Our work is the first to establish the usefulness of DNN verification in identifying and filtering out suboptimal DRL policies in real-world robots, and we believe that the methods presented here are applicable to a wide range of systems that incorporate deep-learning-based agents.
    OpenCon: Open-world Contrastive Learning. (arXiv:2208.02764v2 [cs.LG] UPDATED)
    Machine learning models deployed in the wild naturally encounter unlabeled samples from both known and novel classes. Challenges arise in learning from both the labeled and unlabeled data, in an open-world semi-supervised manner. In this paper, we introduce a new learning framework, open-world contrastive learning (OpenCon). OpenCon tackles the challenges of learning compact representations for both known and novel classes and facilitates novelty discovery along the way. We demonstrate the effectiveness of OpenCon on challenging benchmark datasets and establish competitive performance. On the ImageNet dataset, OpenCon significantly outperforms the current best method by 11.9% and 7.4% on novel and overall classification accuracy, respectively. Theoretically, OpenCon can be rigorously interpreted from an EM algorithm perspective--minimizing our contrastive loss partially maximizes the likelihood by clustering similar samples in the embedding space. The code is available at https://github.com/deeplearning-wisc/opencon.
    Unraveling the graph structure of tabular data through Bayesian and spectral analysis. (arXiv:2110.01421v2 [cs.LG] UPDATED)
    In the big-data age, tabular data are being generated and analyzed everywhere. As a consequence, finding and understanding the relationships between the features in these data are of great relevance. Here, to encompass these relationships, we propose a graph-based method that allows individual, group and multi-scale analyses. The method starts by mapping the tabular data into a weighted directed graph using the Shapley additive explanations technique. With this graph of relationships, we show that the inference of the hierarchical modular structure obtained by the Nested Stochastic Block Model (nSBM) as well as the study of the spectral space of the magnetic Laplacian can help us identify the classes of features and unravel non-trivial relationships. As a case study, we analyzed a socioeconomic survey conducted with students in Brazil: the PeNSE survey. The spectral embedding of the columns suggested that questions related to physical activities form a separate group. The application of the nSBM approach not only corroborated with that but allowed complementary findings about the modular structure: some groups of questions showed a high adherence with the divisions qualitatively defined by the designers of the survey. As opposed to the structure obtained by the spectrum, questions from the class Safety were partly grouped by our method in the class Drugs. Surprisingly, by inspecting these questions, we observed that they were related to both these topics, suggesting an alternative interpretation of these questions. These results show how our method can provide guidance for tabular data analysis as well as the design of future surveys.
    Investigations on convergence behaviour of Physics Informed Neural Networks across spectral ranges and derivative orders. (arXiv:2301.02790v1 [cs.LG])
    An important inference from Neural Tangent Kernel (NTK) theory is the existence of spectral bias (SB), that is, low frequency components of the target function of a fully connected Artificial Neural Network (ANN) being learnt significantly faster than the higher frequencies during training. This is established for Mean Square Error (MSE) loss functions with very low learning rate parameters. Physics Informed Neural Networks (PINNs) are designed to learn the solutions of differential equations (DE) of arbitrary orders; in PINNs the loss functions are obtained as the residues of the conservative form of the DEs and represent the degree of dissatisfaction of the equations. So there has been an open question whether (a) PINNs also exhibit SB and (b) if so, how does this bias vary across the orders of the DEs. In this work, a series of numerical experiments are conducted on simple sinusoidal functions of varying frequencies, compositions and equation orders to investigate these issues. It is firmly established that under normalized conditions, PINNs do exhibit strong spectral bias, and this increases with the order of the differential equation.
    Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling. (arXiv:2301.03580v1 [cs.CV])
    We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.
    Exponential Family Model-Based Reinforcement Learning via Score Matching. (arXiv:2112.14195v2 [cs.LG] UPDATED)
    We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Under standard regularity assumptions, SMRL achieves $\tilde O(d\sqrt{H^3T})$ online regret, where $H$ is the length of each episode and $T$ is the total number of interactions (ignoring polynomial dependence on structural scale parameters).
    Neuromorphic Wireless Cognition: Event-Driven Semantic Communications for Remote Inference. (arXiv:2206.06047v2 [cs.IT] UPDATED)
    Neuromorphic computing is an emerging computing paradigm that moves away from batched processing towards the online, event-driven, processing of streaming data. Neuromorphic chips, when coupled with spike-based sensors, can inherently adapt to the "semantics" of the data distribution by consuming energy only when relevant events are recorded in the timing of spikes and by proving a low-latency response to changing conditions in the environment. This paper proposes an end-to-end design for a neuromorphic wireless Internet-of-Things system that integrates spike-based sensing, processing, and communication. In the proposed NeuroComm system, each sensing device is equipped with a neuromorphic sensor, a spiking neural network (SNN), and an impulse radio transmitter with multiple antennas. Transmission takes place over a shared fading channel to a receiver equipped with a multi-antenna impulse radio receiver and with an SNN. In order to enable adaptation of the receiver to the fading channel conditions, we introduce a hypernetwork to control the weights of the decoding SNN using pilots. Pilots, encoding SNNs, decoding SNN, and hypernetwork are jointly trained across multiple channel realizations. The proposed system is shown to significantly improve over conventional frame-based digital solutions, as well as over alternative non-adaptive training methods, in terms of time-to-accuracy and energy consumption metrics.
    IAN: Iterated Adaptive Neighborhoods for manifold learning and dimensionality estimation. (arXiv:2208.09123v3 [cs.LG] UPDATED)
    Invoking the manifold assumption in machine learning requires knowledge of the manifold's geometry and dimension, and theory dictates how many samples are required. However, in applications data are limited, sampling may not be uniform, and manifold properties are unknown and (possibly) non-pure; this implies that neighborhoods must adapt to the local structure. We introduce an algorithm for inferring adaptive neighborhoods for data given by a similarity kernel. Starting with a locally-conservative neighborhood (Gabriel) graph, we sparsify it iteratively according to a weighted counterpart. In each step, a linear program yields minimal neighborhoods globally and a volumetric statistic reveals neighbor outliers likely to violate manifold geometry. We apply our adaptive neighborhoods to non-linear dimensionality reduction, geodesic computation and dimension estimation. A comparison against standard algorithms using, e.g., k-nearest neighbors, demonstrates their usefulness. Code for our algorithm will be available at https://github.com/dyballa/IAN
    Reservoir Prediction by Machine Learning Methods on The Well Data and Seismic Attributes for Complex Coastal Conditions. (arXiv:2301.03216v1 [physics.geo-ph])
    The aim of this work was to predict the probability of the spread of rock formations with hydrocarbon-collecting properties in the studied coastal area using a stack of machine learning algorithms and data augmentation and modification methods. This research develops the direction of machine learning where training is conducted on well data and spatial attributes. Methods for overcoming the limitations of this direction are shown, two methods - augmentation and modification of the well data sample: Spindle and Revers-Calibration. Considering the difficulties for seismic data interpretation in coastal area conditions, the proposed approach is a tool which is able to work with the whole totality of geological and geophysical data, extract the knowledge from 159-dimensional space spatial attributes and make facies spreading prediction with acceptable quality - F1 measure for reservoir class 0.798 on average for evaluation of "drilling" results of different geological conditions. It was shown that consistent application of the proposed augmentation methods in the implemented technology stack improves the quality of reservoir prediction by a factor of 1.56 relative to the original dataset.
    Topologically Regularized Data Embeddings. (arXiv:2301.03338v1 [cs.LG])
    Unsupervised representation learning methods are widely used for gaining insight into high-dimensional, unstructured, or structured data. In some cases, users may have prior topological knowledge about the data, such as a known cluster structure or the fact that the data is known to lie along a tree- or graph-structured topology. However, generic methods to ensure such structure is salient in the low-dimensional representations are lacking. This negatively impacts the interpretability of low-dimensional embeddings, and plausibly downstream learning tasks. To address this issue, we introduce topological regularization: a generic approach based on algebraic topology to incorporate topological prior knowledge into low-dimensional embeddings. We introduce a class of topological loss functions, and show that jointly optimizing an embedding loss with such a topological loss function as a regularizer yields embeddings that reflect not only local proximities but also the desired topological structure. We include a self-contained overview of the required foundational concepts in algebraic topology, and provide intuitive guidance on how to design topological loss functions for a variety of shapes, such as clusters, cycles, and bifurcations. We empirically evaluate the proposed approach on computational efficiency, robustness, and versatility in combination with linear and non-linear dimensionality reduction and graph embedding methods.
    UB3: Best Beam Identification in Millimeter Wave Systems via Pure Exploration Unimodal Bandits. (arXiv:2301.03456v1 [eess.SP])
    Millimeter wave (mmWave) communications have a broad spectrum and can support data rates in the order of gigabits per second, as envisioned in 5G systems. However, they cannot be used for long distances due to their sensitivity to attenuation loss. To enable their use in the 5G network, it requires that the transmission energy be focused in sharp pencil beams. As any misalignment between the transmitter and receiver beam pair can reduce the data rate significantly, it is important that they are aligned as much as possible. To find the best transmit-receive beam pair, recent beam alignment (BA) techniques examine the entire beam space, which might result in a large amount of BA latency. Recent works propose to adaptively select the beams such that the cumulative reward measured in terms of received signal strength or throughput is maximized. In this paper, we develop an algorithm that exploits the unimodal structure of the received signal strengths of the beams to identify the best beam in a finite time using pure exploration strategies. Strategies that identify the best beam in a fixed time slot are more suitable for wireless network protocol design than cumulative reward maximization strategies that continuously perform exploration and exploitation. Our algorithm is named Unimodal Bandit for Best Beam (UB3) and identifies the best beam with a high probability in a few rounds. We prove that the error exponent in the probability does not depend on the number of beams and show that this is indeed the case by establishing a lower bound for the unimodal bandits. We demonstrate that UB3 outperforms the state-of-the-art algorithms through extensive simulations. Moreover, our algorithm is simple to implement and has lower computational complexity.
    A Domain-Theoretic Framework for Robustness Analysis of Neural Networks. (arXiv:2203.00295v3 [cs.LG] UPDATED)
    A domain-theoretic framework is presented for validated robustness analysis of neural networks. First, global robustness of a general class of networks is analyzed. Then, using the fact that Edalat's domain-theoretic L-derivative coincides with Clarke's generalized gradient, the framework is extended for attack-agnostic local robustness analysis. The proposed framework is ideal for designing algorithms which are correct by construction. This claim is exemplified by developing a validated algorithm for estimation of Lipschitz constant of feedforward regressors. The completeness of the algorithm is proved over differentiable networks, and also over general position ReLU networks. Computability results are obtained within the framework of effectively given domains. Using the proposed domain model, differentiable and non-differentiable networks can be analyzed uniformly. The validated algorithm is implemented using arbitrary-precision interval arithmetic, and the results of some experiments are presented. The software implementation is truly validated, as it handles floating-point errors as well.
    Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods. (arXiv:2205.01303v3 [stat.ML] UPDATED)
    In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a "converse" Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).
    Reinforcement Learning for Joint Optimization of Multiple Rewards. (arXiv:1909.02940v4 [cs.LG] UPDATED)
    Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.
    Hierarchical Federated Learning with Quantization: Convergence Analysis and System Design. (arXiv:2103.14272v2 [cs.LG] UPDATED)
    Federated learning (FL) is a powerful distributed machine learning framework where a server aggregates models trained by different clients without accessing their private data. Hierarchical FL, with a client-edge-cloud aggregation hierarchy, can effectively leverage both the cloud server's access to many clients' data and the edge servers' closeness to the clients to achieve a high communication efficiency. Neural network quantization can further reduce the communication overhead during model uploading. To fully exploit the advantages of hierarchical FL, an accurate convergence analysis with respect to the key system parameters is needed. Unfortunately, existing analysis is loose and does not consider model quantization. In this paper, we derive a tighter convergence bound for hierarchical FL with quantization. The convergence result leads to practical guidelines for important design problems such as the client-edge aggregation and edge-client association strategies. Based on the obtained analytical results, we optimize the two aggregation intervals and show that the client-edge aggregation interval should slowly decay while the edge-cloud aggregation interval needs to adapt to the ratio of the client-edge and edge-cloud propagation delay. Simulation results shall verify the design guidelines and demonstrate the effectiveness of the proposed aggregation strategy.
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation. (arXiv:2203.05400v3 [math.ST] UPDATED)
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, including the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood estimate of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 + d/2$, then the smoothness parameter estimate cannot be asymptotically less than $\nu_0 + d/2$. The lower bound is sharp. Additionally, we show that maximum likelihood estimation finds the "correct" smoothness for a class of compactly supported self-similar functions. We also consider cross-validation and prove an asymptotic lower bound $\nu_0$, which however is unlikely to be sharp. The results are based on approximation theory in Sobolev spaces and some general theorems that restrict the set of values that the parameter estimators can take.
    Self-mentoring: a new deep learning pipeline to train a self-supervised U-net for few-shot learning of bio-artificial capsule segmentation. (arXiv:2205.10840v3 [cs.CV] UPDATED)
    Background: Accurate segmentation of microscopic structures such as bio-artificial capsules in microscopy imaging is a prerequisite to the computer-aided understanding of important biomechanical phenomenons. State-of-the-art segmentation performances are achieved by deep neural networks and related data-driven approaches. Training these networks from only a few annotated examples is challenging while producing manually annotated images that provide supervision is tedious. Method: Recently, self-supervision, i.e. designing a neural pipeline providing synthetic or indirect supervision, has proved to significantly increase generalization performances of models trained on few shots. The objective of this paper is to introduce one such neural pipeline in the context of micro-capsule image segmentation. Our method leverages the rather simple content of these images so that a trainee network can be mentored by a referee network which has been previously trained on synthetically generated pairs of corrupted/correct region masks. Results: Challenging experimental setups are investigated. They involve from only 3 to 10 annotated images along with moderately large amounts of unannotated images. In a bio-artificial capsule dataset, our approach consistently and drastically improves accuracy. We also show that the learnt referee network is transferable to another Glioblastoma cell dataset and that it can be efficiently coupled with data augmentation strategies. Conclusions: Experimental results show that very significant accuracy increments are obtained by the proposed pipeline, leading to the conclusion that the self-supervision mechanism introduced in this paper has the potential to replace human annotations.
    Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction. (arXiv:2301.03573v1 [cs.LG])
    Despite impressive performance on a wide variety of tasks, deep neural networks require significant memory and computation costs, prohibiting their application in resource-constrained scenarios. Sparse training is one of the most common techniques to reduce these costs, however, the sparsity constraints add difficulty to the optimization, resulting in an increase in training time and instability. In this work, we aim to overcome this problem and achieve space-time co-efficiency. To accelerate and stabilize the convergence of sparse training, we analyze the gradient changes and develop an adaptive gradient correction method. Specifically, we approximate the correlation between the current and previous gradients, which is used to balance the two gradients to obtain a corrected gradient. Our method can be used with most popular sparse training pipelines under both standard and adversarial setups. Theoretically, we prove that our method can accelerate the convergence rate of sparse training. Extensive experiments on multiple datasets, model architectures, and sparsities demonstrate that our method outperforms leading sparse training methods by up to \textbf{5.0\%} in accuracy given the same number of training epochs, and reduces the number of training epochs by up to \textbf{52.1\%} to achieve the same accuracy.
    A Comprehensive Taxonomy for Explainable Artificial Intelligence: A Systematic Survey of Surveys on Methods and Concepts. (arXiv:2105.07190v4 [cs.LG] UPDATED)
    In the meantime, a wide variety of terminologies, motivations, approaches, and evaluation criteria have been developed within the research field of explainable artificial intelligence (XAI). With the amount of XAI methods vastly growing, a taxonomy of methods is needed by researchers as well as practitioners: To grasp the breadth of the topic, compare methods, and to select the right XAI method based on traits required by a specific use-case context. Many taxonomies for XAI methods of varying level of detail and depth can be found in the literature. While they often have a different focus, they also exhibit many points of overlap. This paper unifies these efforts and provides a complete taxonomy of XAI methods with respect to notions present in the current state of research. In a structured literature analysis and meta-study, we identified and reviewed more than 50 of the most cited and current surveys on XAI methods, metrics, and method traits. After summarizing them in a survey of surveys, we merge terminologies and concepts of the articles into a unified structured taxonomy. Single concepts therein are illustrated by more than 50 diverse selected example methods in total, which we categorize accordingly. The taxonomy may serve both beginners, researchers, and practitioners as a reference and wide-ranging overview of XAI method traits and aspects. Hence, it provides foundations for targeted, use-case-oriented, and context-sensitive future research.
    Robust Feature-Level Adversaries are Interpretability Tools. (arXiv:2110.03605v6 [cs.LG] UPDATED)
    The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv
    SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain. (arXiv:2301.02966v1 [cs.CL])
    This paper introduces SpeeChain, an open-source Pytorch-based toolkit designed to develop the machine speech chain for large-scale use. This first release focuses on the TTS-to-ASR chain, a core component of the machine speech chain, that refers to the TTS data augmentation by unspoken text for ASR. To build an efficient pipeline for the large-scale TTS-to-ASR chain, we implement easy-to-use multi-GPU batch-level model inference, multi-dataloader batch generation, and on-the-fly data selection techniques. In this paper, we first explain the overall procedure of the TTS-to-ASR chain and the difficulties of each step. Then, we present a detailed ablation study on different types of unlabeled data, data filtering thresholds, batch composition, and real-synthetic data ratios. Our experimental results on train_clean_460 of LibriSpeech demonstrate that our TTS-to-ASR chain can significantly improve WER in a semi-supervised setting.
    Nuclear Segmentation and Classification: On Color & Compression Generalization. (arXiv:2301.03418v1 [eess.IV])
    Since the introduction of digital and computational pathology as a field, one of the major problems in the clinical application of algorithms has been the struggle to generalize well to examples outside the distribution of the training data. Existing work to address this in both pathology and natural images has focused almost exclusively on classification tasks. We explore and evaluate the robustness of the 7 best performing nuclear segmentation and classification models from the largest computational pathology challenge for this problem to date, the CoNIC challenge. We demonstrate that existing state-of-the-art (SoTA) models are robust towards compression artifacts but suffer substantial performance reduction when subjected to shifts in the color domain. We find that using stain normalization to address the domain shift problem can be detrimental to the model performance. On the other hand, neural style transfer is more consistent in improving test performance when presented with large color variations in the wild.
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v4 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.
    Diffusion models as plug-and-play priors. (arXiv:2206.09012v3 [cs.LG] UPDATED)
    We consider the problem of inferring high-dimensional data $\mathbf{x}$ in a model that consists of a prior $p(\mathbf{x})$ and an auxiliary differentiable constraint $c(\mathbf{x},\mathbf{y})$ on $x$ given some additional information $\mathbf{y}$. In this paper, the prior is an independently trained denoising diffusion generative model. The auxiliary constraint is expected to have a differentiable form, but can come from diverse sources. The possibility of such inference turns diffusion models into plug-and-play modules, thereby allowing a range of potential applications in adapting models to new domains and tasks, such as conditional generation or image segmentation. The structure of diffusion models allows us to perform approximate inference by iterating differentiation through the fixed denoising network enriched with different amounts of noise at each step. Considering many noised versions of $\mathbf{x}$ in evaluation of its fitness is a novel search mechanism that may lead to new algorithms for solving combinatorial optimization problems.
    Automatic Differentiation of Programs with Discrete Randomness. (arXiv:2210.08572v3 [cs.LG] UPDATED)
    Automatic differentiation (AD), a technique for constructing new programs which compute the derivative of an original program, has become ubiquitous throughout scientific computing and deep learning due to the improved performance afforded by gradient-based optimization. However, AD systems have been restricted to the subset of programs that have a continuous dependence on parameters. Programs that have discrete stochastic behaviors governed by distribution parameters, such as flipping a coin with probability $p$ of being heads, pose a challenge to these systems because the connection between the result (heads vs tails) and the parameters ($p$) is fundamentally discrete. In this paper we develop a new reparameterization-based methodology that allows for generating programs whose expectation is the derivative of the expectation of the original program. We showcase how this method gives an unbiased and low-variance estimator which is as automated as traditional AD mechanisms. We demonstrate unbiased forward-mode AD of discrete-time Markov chains, agent-based models such as Conway's Game of Life, and unbiased reverse-mode AD of a particle filter. Our code package is available at https://github.com/gaurav-arya/StochasticAD.jl.
    ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data. (arXiv:2301.02819v1 [cs.LG])
    Though neural networks have achieved enormous breakthroughs on various fields (e.g., computer vision) in supervised learning, they still trailed the performances of GBDTs on tabular data thus far. Delving into this issue, we identify that a proper handling of feature interactions and feature embedding is crucial to the success of neural networks on tabular data. We develop a novel neural network called ExcelFormer, which alternates in turn two attention modules that respectively manipulate careful feature interactions and feature embedding updates. A bespoke training methodology is jointly introduced to facilitate the model performances. By initializing parameters with minuscule values, these attention modules are attenuated when the training begins, and the effects of feature interactions and embedding updates progressively grow up to optimum levels under the guidance of the proposed specific regularization approaches Swap-Mix and Hidden-Mix as the training proceeds. Experiments on 25 public tabular datasets show that our ExcelFormer is superior to extremely-tuned GBDTs, which is an unprecedented achievement of neural networks in supervised tabular learning.
    Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior. (arXiv:2301.02952v1 [cs.LG])
    Many real-world reinforcement learning (RL) problems necessitate learning complex, temporally extended behavior that may only receive reward signal when the behavior is completed. If the reward-worthy behavior is known, it can be specified in terms of a non-Markovian reward function - a function that depends on aspects of the state-action history, rather than just the current state and action. Such reward functions yield sparse rewards, necessitating an inordinate number of experiences to find a policy that captures the reward-worthy pattern of behavior. Recent work has leveraged Knowledge Representation (KR) to provide a symbolic abstraction of aspects of the state that summarize reward-relevant properties of the state-action history and support learning a Markovian decomposition of the problem in terms of an automaton over the KR. Providing such a decomposition has been shown to vastly improve learning rates, especially when coupled with algorithms that exploit automaton structure. Nevertheless, such techniques rely on a priori knowledge of the KR. In this work, we explore how to automatically discover useful state abstractions that support learning automata over the state-action history. The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL on simple non-Markovian domains.
    An open unified deep graph learning framework for discovering drug leads. (arXiv:2301.03424v1 [q-bio.BM])
    Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible models increases research overheads, and may even reduce success rates in drug discovery. Facilitating compatibilities requires establishing inherent model consistencies across lead discovery stages. Towards that effect, we propose an open deep graph learning (DGL) based pipeline: generative adversarial feature subspace enhancement (GAFSE), which first unifies the modeling of these stages into one learning framework. GAFSE also offers standardized modular design and streamlined interfaces for future expansions and community support. GAFSE combines adversarial/generative learning, graph attention network, graph reconstruction network, and optimizes the classification/regression loss, adversarial/generative loss, and reconstruction loss simultaneously. Convergence analysis theoretically guarantees model generalization performance. Exhaustive benchmarking demonstrates that the GAFSE pipeline achieves excellent performance across almost all lead discovery stages, while also providing valuable model interpretability. Hence, we believe this tool will enhance the efficiency and productivity of drug discovery researchers.
    Generalized Kernel Regularized Least Squares. (arXiv:2209.14355v2 [stat.ML] UPDATED)
    Kernel Regularized Least Squares (KRLS) is a popular method for flexibly estimating models that may have complex relationships between variables. However, its usefulness to many researchers is limited for two reasons. First, existing approaches are inflexible and do not allow KRLS to be combined with theoretically-motivated extensions such as random effects, unregularized fixed effects, or non-Gaussian outcomes. Second, estimation is extremely computationally intensive for even modestly sized datasets. Our paper addresses both concerns by introducing generalized KRLS (gKRLS). We note that KRLS can be re-formulated as a hierarchical model thereby allowing easy inference and modular model construction where KRLS can be used alongside random effects, splines, and unregularized fixed effects. Computationally, we also implement random sketching to dramatically accelerate estimation while incurring a limited penalty in estimation quality. We demonstrate that gKRLS can be fit on datasets with tens of thousands of observations in under one minute. Further, state-of-the-art techniques that require fitting the model over a dozen times (e.g. meta-learners) can be estimated quickly.
    Wasserstein Iterative Networks for Barycenter Estimation. (arXiv:2201.12245v2 [cs.LG] UPDATED)
    Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba! dataset which can be used for quantitative evaluation of barycenter algorithms by using standard metrics of generative models such as FID.
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v5 [cs.LG] UPDATED)
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with $G$ a group, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of ``$G$-invariant neural architecture design'': What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or ``shallow'' neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$, and we prove a second theorem that characterizes the inclusion maps or ``network morphisms'' between the architectures that can be leveraged during neural architecture search (NAS). The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons; the classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. The $G$-SNN architectures corresponding to nontrivial cohomology classes have, to our knowledge, never been explicitly identified in the literature previously. Using a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. Finally, we prove that architectures corresponding to inequivalent cohomology classes coincide in function space only when their weight matrices are zero, and we discuss the implications of this for NAS.
    Deep Injective Prior for Inverse Scattering. (arXiv:2301.03092v1 [cs.LG])
    In electromagnetic inverse scattering, we aim to reconstruct object permittivity from scattered waves. Deep learning is a promising alternative to traditional iterative solvers, but it has been used mostly in a supervised framework to regress the permittivity patterns from scattered fields or back-projections. While such methods are fast at test-time and achieve good results for specific data distributions, they are sensitive to the distribution drift of the scattered fields, common in practice. If the distribution of the scattered fields changes due to changes in frequency, the number of transmitters and receivers, or any other real-world factor, an end-to-end neural network must be re-trained or fine-tuned on a new dataset. In this paper, we propose a new data-driven framework for inverse scattering based on deep generative models. We model the target permittivities by a low-dimensional manifold which acts as a regularizer and learned from data. Unlike supervised methods which require both scattered fields and target signals, we only need the target permittivities for training; it can then be used with any experimental setup. We show that the proposed framework significantly outperforms the traditional iterative methods especially for strong scatterers while having comparable reconstruction quality to state-of-the-art deep learning methods like U-Net.
    Joint Liver and Hepatic Lesion Segmentation in MRI using a Hybrid CNN with Transformer Layers. (arXiv:2201.10981v2 [eess.IV] UPDATED)
    Deep learning-based segmentation of the liver and hepatic lesions therein steadily gains relevance in clinical practice due to the increasing incidence of liver cancer each year. Whereas various network variants with overall promising results in the field of medical image segmentation have been successfully developed over the last years, almost all of them struggle with the challenge of accurately segmenting hepatic lesions in magnetic resonance imaging (MRI). This led to the idea of combining elements of convolutional and transformer-based architectures to overcome the existing limitations. This work presents a hybrid network called SWTR-Unet, consisting of a pretrained ResNet, transformer blocks as well as a common Unet-style decoder path. This network was primarily applied to single-modality non-contrast-enhanced liver MRI and additionally to the publicly available computed tomography (CT) data of the liver tumor segmentation (LiTS) challenge to verify the applicability on other modalities. For a broader evaluation, multiple state-of-the-art networks were implemented and applied, ensuring a direct comparability. Furthermore, correlation analysis and an ablation study were carried out, to investigate various influencing factors on the segmentation accuracy of the presented method. With Dice scores of averaged 98+-2% for liver and 81+-28% lesion segmentation on the MRI dataset and 97+-2% and 79+-25%, respectively on the CT dataset, the proposed SWTR-Unet proved to be a precise approach for liver and hepatic lesion segmentation with state-of-the-art results for MRI and competing accuracy in CT imaging. The achieved segmentation accuracy was found to be on par with manually performed expert segmentations as indicated by inter-observer variabilities for liver lesion segmentation. In conclusion, the presented method could save valuable time and resources in clinical practice.
    Deepfake CAPTCHA: A Method for Preventing Fake Calls. (arXiv:2301.03064v1 [cs.CR])
    Deep learning technology has made it possible to generate realistic content of specific individuals. These `deepfakes' can now be generated in real-time which enables attackers to impersonate people over audio and video calls. Moreover, some methods only need a few images or seconds of audio to steal an identity. Existing defenses perform passive analysis to detect fake content. However, with the rapid progress of deepfake quality, this may be a losing game. In this paper, we propose D-CAPTCHA: an active defense against real-time deepfakes. The approach is to force the adversary into the spotlight by challenging the deepfake model to generate content which exceeds its capabilities. By doing so, passive detection becomes easier since the content will be distorted. In contrast to existing CAPTCHAs, we challenge the AI's ability to create content as opposed to its ability to classify content. In this work we focus on real-time audio deepfakes and present preliminary results on video. In our evaluation we found that D-CAPTCHA outperforms state-of-the-art audio deepfake detectors with an accuracy of 91-100% depending on the challenge (compared to 71% without challenges). We also performed a study on 41 volunteers to understand how threatening current real-time deepfake attacks are. We found that the majority of the volunteers could not tell the difference between real and fake audio.
    Upward lightning at wind turbines: Risk assessment from larger-scale meteorology. (arXiv:2301.03360v1 [stat.ML])
    Upward lightning (UL) has become an increasingly important threat to wind turbines as ever more of them are being installed for renewably producing electricity. The taller the wind turbine the higher the risk that the type of lightning striking the man-made structure is UL. UL can be much more destructive than downward lightning due to its long lasting initial continuous current leading to a large charge transfer within the lightning discharge process. Current standards for the risk assessment of lightning at wind turbines mainly take the summer lightning activity into account, which is inferred from LLS. Ground truth lightning current measurements reveal that less than 50% of UL might be detected by lightning location systems (LLS). This leads to a large underestimation of the proportion of LLS-non-detectable UL at wind turbines, which is the dominant lightning type in the cold season. This study aims to assess the risk of LLS-detectable and LLS-non-detectable UL at wind turbines using direct UL measurements at the Gaisberg Tower (Austria) and S\"antis Tower (Switzerland). Direct UL observations are linked to meteorological reanalysis data and joined by random forests, a powerful machine learning technique. The meteorological drivers for the non-/occurrence of LLS-detectable and LLS-non-detectable UL, respectively, are found from the random forest models trained at the towers and have large predictive skill on independent data. In a second step the results from the tower-trained models are extended to a larger study domain (Central and Northern Germany). The tower-trained models for LLS-detectable lightning is independently verified at wind turbine locations in that domain and found to reliably diagnose that type of UL. Risk maps based on case study events show that high diagnosed probabilities in the study domain coincide with actual UL events.
    $\mathcal{Y}$-Tuning: An Efficient Tuning Paradigm for Large-Scale Pre-Trained Models via Label Representation Learning. (arXiv:2202.09817v2 [cs.CL] UPDATED)
    With the success of large-scale pre-trained models (PTMs), how efficiently adapting PTMs to downstream tasks has attracted tremendous attention, especially for PTMs with billions of parameters. Although some parameter-efficient tuning paradigms have been proposed to address this problem, they still require large resources to compute the gradients in the training phase. In this paper, we propose $\mathcal{Y}$-Tuning, an efficient yet effective paradigm to adapt frozen large-scale PTMs to specific downstream tasks. $\mathcal{Y}$-tuning learns dense representations for labels $\mathcal{Y}$ defined in a given task and aligns them to fixed feature representation. Without tuning the features of input text and model parameters, $\mathcal{Y}$-tuning is both parameter-efficient and training-efficient. For $\text{DeBERTa}_\text{XXL}$ with 1.6 billion parameters, $\mathcal{Y}$-tuning achieves performance more than $96\%$ of full fine-tuning on GLUE Benchmark with only $2\%$ tunable parameters and much fewer training costs.
    Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans. (arXiv:2209.13020v12 [cs.CY] UPDATED)
    We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda embedding legal knowledge and reasoning in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
    So3krates: Equivariant attention for interactions on arbitrary length-scales in molecular systems. (arXiv:2205.14276v3 [cs.LG] UPDATED)
    The application of machine learning methods in quantum chemistry has enabled the study of numerous chemical phenomena, which are computationally intractable with traditional ab-initio methods. However, some quantum mechanical properties of molecules and materials depend on non-local electronic effects, which are often neglected due to the difficulty of modeling them efficiently. This work proposes a modified attention mechanism adapted to the underlying physics, which allows to recover the relevant non-local effects. Namely, we introduce spherical harmonic coordinates (SPHCs) to reflect higher-order geometric information for each atom in a molecule, enabling a non-local formulation of attention in the SPHC space. Our proposed model So3krates - a self-attention based message passing neural network - uncouples geometric information from atomic features, making them independently amenable to attention mechanisms. Thereby we construct spherical filters, which extend the concept of continuous filters in Euclidean space to SPHC space and serve as foundation for a spherical self-attention mechanism. We show that in contrast to other published methods, So3krates is able to describe non-local quantum mechanical effects over arbitrary length scales. Further, we find evidence that the inclusion of higher-order geometric correlations increases data efficiency and improves generalization. So3krates matches or exceeds state-of-the-art performance on popular benchmarks, notably, requiring a significantly lower number of parameters (0.25 - 0.4x) while at the same time giving a substantial speedup (6 - 14x for training and 2 - 11x for inference) compared to other models.
    Contrastive Trajectory Similarity Learning with Dual-Feature Attention. (arXiv:2210.05155v2 [cs.DB] UPDATED)
    Trajectory similarity measures act as query predicates in trajectory databases, making them the key player in determining the query results. They also have a heavy impact on the query efficiency. An ideal measure should have the capability to accurately evaluate the similarity between any two trajectories in a very short amount of time. Towards this aim, we propose a contrastive learning-based trajectory modeling method named TrajCL. We present four trajectory augmentation methods and a novel dual-feature self-attention-based trajectory backbone encoder. The resultant model can jointly learn both the spatial and the structural patterns of trajectories. Our model does not involve any recurrent structures and thus has a high efficiency. Besides, our pre-trained backbone encoder can be fine-tuned towards other computationally expensive measures with minimal supervision data. Experimental results show that TrajCL is consistently and significantly more accurate than the state-of-the-art trajectory similarity measures. After fine-tuning, i.e., to serve as an estimator for heuristic measures, TrajCL can even outperform the state-of-the-art supervised method by up to 56% in the accuracy for processing trajectory similarity queries.
    Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation. (arXiv:2301.03125v1 [stat.ML])
    The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term. In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we provide a two-phase extension of M-SPP to achieve comparable convergence rates. Moreover, we derive a near-tight high probability (over the randomness of data) bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.
    AnycostFL: Efficient On-Demand Federated Learning over Heterogeneous Edge Devices. (arXiv:2301.03062v1 [cs.LG])
    In this work, we investigate the challenging problem of on-demand federated learning (FL) over heterogeneous edge devices with diverse resource constraints. We propose a cost-adjustable FL framework, named AnycostFL, that enables diverse edge devices to efficiently perform local updates under a wide range of efficiency constraints. To this end, we design the model shrinking to support local model training with elastic computation cost, and the gradient compression to allow parameter transmission with dynamic communication overhead. An enhanced parameter aggregation is conducted in an element-wise manner to improve the model performance. Focusing on AnycostFL, we further propose an optimization design to minimize the global training loss with personalized latency and energy constraints. By revealing the theoretical insights of the convergence analysis, personalized training strategies are deduced for different devices to match their locally available resources. Experiment results indicate that, when compared to the state-of-the-art efficient FL algorithms, our learning framework can reduce up to 1.9 times of the training latency and energy consumption for realizing a reasonable global testing accuracy. Moreover, the results also demonstrate that, our approach significantly improves the converged global accuracy.
    Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons. (arXiv:2301.03118v1 [cs.CR])
    In this paper we describe how to plant novel types of backdoors in any facial recognition model based on the popular architecture of deep Siamese neural networks, by mathematically changing a small fraction of its weights (i.e., without using any additional training or optimization). These backdoors force the system to err only on specific persons which are preselected by the attacker. For example, we show how such a backdoored system can take any two images of a particular person and decide that they represent different persons (an anonymity attack), or take any two images of a particular pair of persons and decide that they represent the same person (a confusion attack), with almost no effect on the correctness of its decisions for other persons. Uniquely, we show that multiple backdoors can be independently installed by multiple attackers who may not be aware of each other's existence with almost no interference. We have experimentally verified the attacks on a FaceNet-based facial recognition system, which achieves SOTA accuracy on the standard LFW dataset of $99.35\%$. When we tried to individually anonymize ten celebrities, the network failed to recognize two of their images as being the same person in $96.97\%$ to $98.29\%$ of the time. When we tried to confuse between the extremely different looking Morgan Freeman and Scarlett Johansson, for example, their images were declared to be the same person in $91.51 \%$ of the time. For each type of backdoor, we sequentially installed multiple backdoors with minimal effect on the performance of each one (for example, anonymizing all ten celebrities on the same model reduced the success rate for each celebrity by no more than $0.91\%$). In all of our experiments, the benign accuracy of the network on other persons was degraded by no more than $0.48\%$ (and in most cases, it remained above $99.30\%$).
    Generalized adaptive smoothing based neural network architecture for traffic state estimation. (arXiv:2301.03439v1 [eess.SY])
    The adaptive smoothing method (ASM) is a standard data-driven technique used in traffic state estimation. The ASM has free parameters which, in practice, are chosen to be some generally acceptable values based on intuition. However, we note that the heuristically chosen values often result in un-physical predictions by the ASM. In this work, we propose a neural network based on the ASM which tunes those parameters automatically by learning from sparse data from road sensors. We refer to it as the adaptive smoothing neural network (ASNN). We also propose a modified ASNN (MASNN), which makes it a strong learner by using ensemble averaging. The ASNN and MASNN are trained and tested two real-world datasets. Our experiments reveal that the ASNN and the MASNN outperform the conventional ASM.
    L-SeqSleepNet: Whole-cycle Long Sequence Modelling for Automatic Sleep Staging. (arXiv:2301.03441v1 [eess.SP])
    Human sleep is cyclical with a period of approximately 90 minutes, implying long temporal dependency in the sleep data. Yet, exploring this long-term dependency when developing sleep staging models has remained untouched. In this work, we show that while encoding the logic of a whole sleep cycle is crucial to improve sleep staging performance, the sequential modelling approach in existing state-of-the-art deep learning models are inefficient for that purpose. We then introduce a method for efficient long sequence modelling and propose a new deep learning model, L-SeqSleepNet, incorporating this method to take into account whole-cycle sleep information for sleep staging. Evaluating L-SeqSleepNet on a set of four distinct databases of various sizes, we demonstrate state-of-the-art performance obtained by the model over three different EEG setups, including scalp EEG in conventional Polysomnography (PSG), in-ear EEG, and around-the-ear EEG (cEEGrid), even with a single-EEG channel input. Our analyses also show that L-SeqSleepNet is able to remedy the effect of N2 sleep (the major class in terms of classification) to bring down errors in other sleep stages and that the network largely reduces exceptionally high errors seen in many subjects. Finally, the computation time only grows at a sub-linear rate when the sequence length increases.
    Fair Clustering Under a Bounded Cost. (arXiv:2106.07239v2 [cs.LG] UPDATED)
    Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the ''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.
    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. (arXiv:2301.03566v1 [math.ST])
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.
    Improved Training of Physics-Informed Neural Networks with Model Ensembles. (arXiv:2204.05108v2 [cs.LG] UPDATED)
    Learning the solution of partial differential equations (PDEs) with a neural network (known in the literature as a physics-informed neural network, PINN) is an attractive alternative to traditional solvers due to its elegancy, greater flexibility and the ease of incorporating observed data. However, training PINNs is notoriously difficult in practice. One problem is the existence of multiple simple (but wrong) solutions which are attractive for PINNs when the solution interval is too large. In this paper, we propose to expand the solution interval gradually to make the PINN converge to the correct solution. To find a good schedule for the solution interval expansion, we train an ensemble of PINNs. The idea is that all ensemble members converge to the same solution in the vicinity of observed data (e.g., initial conditions) while they may be pulled towards different wrong solutions farther away from the observations. Therefore, we use the ensemble agreement as the criterion for including new points for computing the loss derived from PDEs. We show experimentally that the proposed method can improve the accuracy of the found solution.
    Discovering and Explaining the Representation Bottleneck of Graph Neural Networks from Multi-order Interactions. (arXiv:2205.07266v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) mainly rely on the message-passing paradigm to propagate node features and build interactions, and different graph learning tasks require different ranges of node interactions. In this work, we explore the capacity of GNNs to capture interactions between nodes under contexts with different complexities. We discover that GNNs are usually unable to capture the most informative kinds of interaction styles for diverse graph learning tasks, and thus name this phenomenon as GNNs' representation bottleneck. As a response, we demonstrate that the inductive bias introduced by existing graph construction mechanisms can prevent GNNs from learning interactions of the most appropriate complexity, i.e., resulting in the representation bottleneck. To address that limitation, we propose a novel graph rewiring approach based on interaction patterns learned by GNNs to adjust the receptive fields of each node dynamically. Extensive experiments on both real-world and synthetic datasets prove the effectiveness of our algorithm to alleviate the representation bottleneck and its superiority to enhance the performance of GNNs over state-of-the-art graph rewiring baselines.
    Can Foundation Models Help Us Achieve Perfect Secrecy?. (arXiv:2205.13722v2 [cs.LG] UPDATED)
    A key promise of machine learning is the ability to assist users with personal tasks. Because the personal context required to make accurate predictions is often sensitive, we require systems that protect privacy. A gold standard privacy-preserving system will satisfy perfect secrecy, meaning that interactions with the system provably reveal no private information. However, privacy and quality appear to be in tension in existing systems for personal tasks. Neural models typically require copious amounts of training to perform well, while individual users typically hold a limited scale of data, so federated learning (FL) systems propose to learn from the aggregate data of multiple users. FL does not provide perfect secrecy, but rather practitioners apply statistical notions of privacy -- i.e., the probability of learning private information about a user should be reasonably low. The strength of the privacy guarantee is governed by privacy parameters. Numerous privacy attacks have been demonstrated on FL systems and it can be challenging to reason about the appropriate privacy parameters for a privacy-sensitive use case. Therefore our work proposes a simple baseline for FL, which both provides the stronger perfect secrecy guarantee and does not require setting any privacy parameters. We initiate the study of when and where an emerging tool in ML -- the in-context learning abilities of recent pretrained models -- can be an effective baseline alongside FL. We find in-context learning is competitive with strong FL baselines on 6 of 7 popular benchmarks from the privacy literature and a real-world case study, which is disjoint from the pretraining data. We release our code here: https://github.com/simran-arora/focus
    Kantorovich Strikes Back! Wasserstein GANs are not Optimal Transport?. (arXiv:2206.07767v2 [cs.LG] UPDATED)
    Wasserstein Generative Adversarial Networks (WGANs) are the popular generative models built on the theory of Optimal Transport (OT) and the Kantorovich duality. Despite the success of WGANs, it is still unclear how well the underlying OT dual solvers approximate the OT cost (Wasserstein-1 distance, $\mathbb{W}_{1}$) and the OT gradient needed to update the generator. In this paper, we address these questions. We construct 1-Lipschitz functions and use them to build ray monotone transport plans. This strategy yields pairs of continuous benchmark distributions with the analytically known OT plan, OT cost and OT gradient in high-dimensional spaces such as spaces of images. We thoroughly evaluate popular WGAN dual form solvers (gradient penalty, spectral normalization, entropic regularization, etc.) using these benchmark pairs. Even though these solvers perform well in WGANs, none of them faithfully compute $\mathbb{W}_{1}$ in high dimensions. Nevertheless, many provide a meaningful approximation of the OT gradient. These observations suggest that these solvers should not be treated as good estimators of $\mathbb{W}_{1}$, but to some extent they indeed can be used in variational problems requiring the minimization of $\mathbb{W}_{1}$.
    Making Decisions under Outcome Performativity. (arXiv:2210.01745v2 [cs.LG] UPDATED)
    Decision-makers often act in response to data-driven predictions, with the goal of achieving favorable outcomes. In such settings, predictions don't passively forecast the future; instead, predictions actively shape the distribution of outcomes they are meant to predict. This performative prediction setting raises new challenges for learning "optimal" decision rules. In particular, existing solution concepts do not address the apparent tension between the goals of forecasting outcomes accurately and steering individuals to achieve desirable outcomes. To contend with this concern, we introduce a new optimality concept -- performative omniprediction -- adapted from the supervised (non-performative) learning setting. A performative omnipredictor is a single predictor that simultaneously encodes the optimal decision rule with respect to many possibly-competing objectives. Our main result demonstrates that efficient performative omnipredictors exist, under a natural restriction of performative prediction, which we call outcome performativity. On a technical level, our results follow by carefully generalizing the notion of outcome indistinguishability to the outcome performative setting. From an appropriate notion of Performative OI, we recover many consequences known to hold in the supervised setting, such as omniprediction and universal adaptability.
    Annealed Score-Based Diffusion Model for MR Motion Artifact Reduction. (arXiv:2301.03027v1 [eess.IV])
    Motion artifact reduction is one of the important research topics in MR imaging, as the motion artifact degrades image quality and makes diagnosis difficult. Recently, many deep learning approaches have been studied for motion artifact reduction. Unfortunately, most existing models are trained in a supervised manner, requiring paired motion-corrupted and motion-free images, or are based on a strict motion-corruption model, which limits their use for real-world situations. To address this issue, here we present an annealed score-based diffusion model for MRI motion artifact reduction. Specifically, we train a score-based model using only motion-free images, and then motion artifacts are removed by applying forward and reverse diffusion processes repeatedly to gradually impose a low-frequency data consistency. Experimental results verify that the proposed method successfully reduces both simulated and in vivo motion artifacts, outperforming the state-of-the-art deep learning methods.
    Machine-Learning Prediction of the Computed Band Gaps of Double Perovskite Materials. (arXiv:2301.03372v1 [cond-mat.mtrl-sci])
    Prediction of the electronic structure of functional materials is essential for the engineering of new devices. Conventional electronic structure prediction methods based on density functional theory (DFT) suffer from not only high computational cost, but also limited accuracy arising from the approximations of the exchange-correlation functional. Surrogate methods based on machine learning have garnered much attention as a viable alternative to bypass these limitations, especially in the prediction of solid-state band gaps, which motivated this research study. Herein, we construct a random forest regression model for band gaps of double perovskite materials, using a dataset of 1306 band gaps computed with the GLLBSC (Gritsenko, van Leeuwen, van Lenthe, and Baerends solid correlation) functional. Among the 20 physical features employed, we find that the bulk modulus, superconductivity temperature, and cation electronegativity exhibit the highest importance scores, consistent with the physics of the underlying electronic structure. Using the top 10 features, a model accuracy of 85.6% with a root mean square error of 0.64 eV is obtained, comparable to previous studies. Our results are significant in the sense that they attest to the potential of machine learning regressions for the rapid screening of promising candidate functional materials.
    Optimization-based Causal Estimation from Heterogenous Environments. (arXiv:2109.11990v2 [stat.ME] UPDATED)
    This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.
    Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions. (arXiv:2203.09436v4 [math.OC] UPDATED)
    We study stochastic monotone inclusion problems, which widely appear in machine learning applications, including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction. In the cocoercive -- and more generally Lipschitz-monotone -- setup, our algorithm attains $\epsilon$ norm of the operator with $\mathcal{O}(\frac{1}{\epsilon^3})$ stochastic operator evaluations, which significantly improves over state of the art $\mathcal{O}(\frac{1}{\epsilon^4})$ stochastic operator evaluations required for existing monotone inclusion solvers applied to the same problem classes. We further show how to couple one of the proposed variants of stochastic Halpern iteration with a scheduled restart scheme to solve stochastic monotone inclusion problems with ${\mathcal{O}}(\frac{\log(1/\epsilon)}{\epsilon^2})$ stochastic operator evaluations under additional sharpness or strong monotonicity assumptions.
    Differentiable Safe Controller Design through Control Barrier Functions. (arXiv:2209.10034v2 [eess.SY] UPDATED)
    Learning-based controllers, such as neural network (NN) controllers, can show high empirical performance but lack formal safety guarantees. To address this issue, control barrier functions (CBFs) have been applied as a safety filter to monitor and modify the outputs of learning-based controllers in order to guarantee the safety of the closed-loop system. However, such modification can be myopic with unpredictable long-term effects. In this work, we propose a safe-by-construction NN controller which employs differentiable CBF-based safety layers, and investigate the performance of safe-by-construction NN controllers in learning-based control. Specifically, two formulations of controllers are compared: one is projection-based and the other relies on our proposed set-theoretic parameterization. Both methods demonstrate improved closed-loop performance over using CBF as a separate safety filter in numerical experiments.
    Check Your Other Door! Creating Backdoor Attacks in the Frequency Domain. (arXiv:2109.05507v3 [cs.CR] UPDATED)
    Deep Neural Networks (DNNs) are ubiquitous and span a variety of applications ranging from image classification to real-time object detection. As DNN models become more sophisticated, the computational cost of training these models becomes a burden. For this reason, outsourcing the training process has been the go-to option for many DNN users. Unfortunately, this comes at the cost of vulnerability to backdoor attacks. These attacks aim to establish hidden backdoors in the DNN so that it performs well on clean samples, but outputs a particular target label when a trigger is applied to the input. Existing backdoor attacks either generate triggers in the spatial domain or naively poison frequencies in the Fourier domain. In this work, we propose a pipeline based on Fourier heatmaps to generate a spatially dynamic and invisible backdoor attack in the frequency domain. The proposed attack is extensively evaluated on various datasets and network architectures. Unlike most existing backdoor attacks, the proposed attack can achieve high attack success rates with low poisoning rates and little to no drop in performance while remaining imperceptible to the human eye. Moreover, we show that the models poisoned by our attack are resistant to various state-of-the-art (SOTA) defenses, so we contribute two possible defenses that can evade the attack.
    Community detection in multiplex networks based on orthogonal nonnegative matrix tri-factorization. (arXiv:2205.00626v2 [cs.SI] UPDATED)
    Networks are commonly used to model complex systems. The different entities in the system are represented by nodes of the network and their interactions by edges. In most real life systems, the different entities may interact in different ways necessitating the use of multiplex networks where multiple links are used to model the interactions. One of the major tools for inferring network topology is community detection. Although there are numerous works on community detection in single-layer networks, existing community detection methods for multiplex networks mostly learn a common community structure across layers and do not take the heterogeneity across layers into account. In this paper, we introduce a new multiplex community detection method that identifies communities that are common across layers as well as those that are unique to each layer. The proposed method, Multiplex Orthogonal Nonnegative Matrix Tri-Factorization, represents the adjacency matrix of each layer as the sum of two low-rank matrix factorizations corresponding to the common and private communities, respectively. Unlike most of the existing methods, which require the number of communities to be pre-determined, the proposed method also introduces a two stage method to determine the number of common and private communities. The proposed algorithm is evaluated on synthetic and real multiplex networks, as well as for multiview clustering applications, and compared to state-of-the-art techniques.
    Learning Program Representations with a Tree-Structured Transformer. (arXiv:2208.08643v2 [cs.SE] UPDATED)
    Learning vector representations for programs is a critical step in applying deep learning techniques for program understanding tasks. Various neural network models are proposed to learn from tree-structured program representations, e.g., abstract syntax tree (AST) and concrete syntax tree (CST). However, most neural architectures either fail to capture long-range dependencies which are ubiquitous in programs, or cannot learn effective representations for syntax tree nodes, making them incapable of performing the node-level prediction tasks, e.g., bug localization. In this paper, we propose Tree-Transformer, a novel recursive tree-structured neural network to learn the vector representations for source codes. We propose a multi-head attention mechanism to model the dependency between siblings and parent-children node pairs. Moreover, we propose a bi-directional propagation strategy to allow node information passing in two directions, bottom-up and top-down along trees. In this way, Tree-Transformer can learn the information of the node features as well as the global contextual information. The extensive experimental results show that our Tree-Transformer significantly outperforms the existing tree-based and graph-based program representation learning approaches in both the tree-level and node-level prediction tasks.
    VQNet 2.0: A New Generation Machine Learning Framework that Unifies Classical and Quantum. (arXiv:2301.03251v1 [quant-ph])
    With the rapid development of classical and quantum machine learning, a large number of machine learning frameworks have been proposed. However, existing machine learning frameworks usually only focus on classical or quantum, rather than both. Therefore, based on VQNet 1.0, we further propose VQNet 2.0, a new generation of unified classical and quantum machine learning framework that supports hybrid optimization. The core library of the framework is implemented in C++, and the user level is implemented in Python, and it supports deployment on quantum and classical hardware. In this article, we analyze the development trend of the new generation machine learning framework and introduce the design principles of VQNet 2.0 in detail: unity, practicality, efficiency, and compatibility, as well as full particulars of implementation. We illustrate the functions of VQNet 2.0 through several basic applications, including classical convolutional neural networks, quantum autoencoders, hybrid classical-quantum networks, etc. After that, through extensive experiments, we demonstrate that the operation speed of VQNet 2.0 is higher than the comparison method. Finally, through extensive experiments, we demonstrate that VQNet 2.0 can deploy on different hardware platforms, the overall calculation speed is faster than the comparison method. It also can be mixed and optimized with quantum circuits composed of multiple quantum computing libraries.
    Deep Insights of Deepfake Technology : A Review. (arXiv:2105.00192v2 [cs.LG] UPDATED)
    Under the aegis of computer vision and deep learning technology, a new emerging techniques has introduced that anyone can make highly realistic but fake videos, images even can manipulates the voices. This technology is widely known as Deepfake Technology. Although it seems interesting techniques to make fake videos or image of something or some individuals but it could spread as misinformation via internet. Deepfake contents could be dangerous for individuals as well as for our communities, organizations, countries religions etc. As Deepfake content creation involve a high level expertise with combination of several algorithms of deep learning, it seems almost real and genuine and difficult to differentiate. In this paper, a wide range of articles have been examined to understand Deepfake technology more extensively. We have examined several articles to find some insights such as what is Deepfake, who are responsible for this, is there any benefits of Deepfake and what are the challenges of this technology. We have also examined several creation and detection techniques. Our study revealed that although Deepfake is a threat to our societies, proper measures and strict regulations could prevent this.
    Locally-symplectic neural networks for learning volume-preserving dynamics. (arXiv:2109.09151v2 [math-ph] UPDATED)
    We propose locally-symplectic neural networks LocSympNets for learning the flow of phase volume-preserving dynamics. The construction of LocSympNets stems from the theorem of the local Hamiltonian description of the divergence-free vector field and the splitting methods based on symplectic integrators. Symplectic gradient modules of the recently proposed symplecticity-preserving neural networks SympNets are used to construct invertible locally-symplectic modules. To further preserve properties of the flow of a dynamical system LocSympNets are extended to symmetric locally-symplectic neural networks SymLocSympNets, such that the inverse of SymLocSympNets is equal to the feed-forward propagation of SymLocSympNets with the negative time step, which is a general property of the flow of a dynamical system. LocSympNets and SymLocSympNets are studied numerically considering learning linear and nonlinear volume-preserving dynamics. We demonstrate learning of linear traveling wave solutions to the semi-discretized advection equation, periodic trajectories of the Euler equations of the motion of a free rigid body, and quasi-periodic solutions of the charged particle motion in an electromagnetic field. LocSympNets and SymLocSympNets can learn linear and nonlinear dynamics to a high degree of accuracy even when random noise is added to the training data. When learning a single trajectory of the rigid body dynamics locally-symplectic neural networks can learn both quadratic invariants of the system with absolute relative errors below 1%. In addition, SymLocSympNets produce qualitatively good long-time predictions, when the learning of the whole system from randomly sampled data is considered. LocSympNets and SymLocSympNets can produce accurate short-time predictions of quasi-periodic solutions, which is illustrated in the example of the charged particle motion in an electromagnetic field.
    Neural parameter calibration for large-scale multi-agent models. (arXiv:2209.13565v2 [math.OC] UPDATED)
    Computational models have become a powerful tool in the quantitative sciences to understand the behaviour of complex systems that evolve in time. However, they often contain a potentially large number of free parameters whose values cannot be obtained from theory but need to be inferred from data. This is especially the case for models in the social sciences, economics, or computational epidemiology. Yet many current parameter estimation methods are mathematically involved and computationally slow to run. In this paper we present a computationally simple and fast method to retrieve accurate probability densities for model parameters using neural differential equations. We present a pipeline comprising multi-agent models acting as forward solvers for systems of ordinary or stochastic differential equations, and a neural network to then extract parameters from the data generated by the model. The two combined create a powerful tool that can quickly estimate densities on model parameters, even for very large systems. We demonstrate the method on synthetic time series data of the SIR model of the spread of infection, and perform an in-depth analysis of the Harris-Wilson model of economic activity on a network, representing a non-convex problem. For the latter, we apply our method both to synthetic data and to data of economic activity across Greater London. We find that our method calibrates the model orders of magnitude more accurately than a previous study of the same dataset using classical techniques, while running between 195 and 390 times faster.
    Retrieving Users' Opinions on Social Media with Multimodal Aspect-Based Sentiment Analysis. (arXiv:2210.15377v2 [cs.IR] UPDATED)
    People post their opinions and experiences on social media, yielding rich databases of end-users' sentiments. This paper shows to what extent machine learning can analyze and structure these databases. An automated data analysis pipeline is deployed to provide insights into user-generated content for researchers in other domains. First, the domain expert can select an image and a term of interest. Then, the pipeline uses image retrieval to find all images showing similar content and applies aspect-based sentiment analysis to outline users' opinions about the selected term. As part of an interdisciplinary project between architecture and computer science researchers, an empirical study of Hamburg's Elbphilharmonie was conveyed. Therefore, we selected 300 thousand posts with the hashtag \enquote{\texttt{hamburg}} from the platform Flickr. Image retrieval methods generated a subset of slightly more than 1.5 thousand images displaying the Elbphilharmonie. We found that these posts mainly convey a neutral or positive sentiment towards it. With this pipeline, we suggest a new semantic computing method that offers novel insights into end-users opinions, e.g., for architecture domain experts.
    FedDebug: Systematic Debugging for Federated Learning Applications. (arXiv:2301.03553v1 [cs.SE])
    In Federated Learning (FL), clients train a model locally and share it with a central aggregator to build a global model. Impermissibility to access client's data and collaborative training makes FL appealing for applications with data-privacy concerns such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, finding the round and the clients responsible is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug's {\em breakpoint} can help inspect an FL state (round, client, and global model) and seamlessly move between rounds and clients' models, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client responsible for lowering global model's performance without any testing data and labels--both are essential for existing debugging techniques. FedDebug's strengths come from adapting differential testing in conjunction with neurons activations to determine the precise client deviating from normal behavior. FedDebug achieves 100\% to find a single client and 90.3\% accuracy to find multiple faulty clients. FedDebug's interactive debugging incurs 1.2\% overhead during training, while it localizes a faulty client in only 2.1\% of a round's training time. With FedDebug, we bring effective debugging practices to federated learning, improving the quality and productivity of FL application developers.
    Differentially private inference via noisy optimization. (arXiv:2103.11003v3 [math.ST] UPDATED)
    We propose a general optimization-based framework for computing differentially private M-estimators and a new method for constructing differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent or noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish local and global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a nearly optimal neighborhood of the non-private M-estimators. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to approximate pivotal statistics for constructing confidence regions and conducting hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods in several numerical examples.
    MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training. (arXiv:2202.03875v2 [eess.AS] UPDATED)
    We introduce two unsupervised source separation methods, which involve self-supervised training from single-channel two-source speech mixtures. Our first method, mixture permutation invariant training (MixPIT), enables learning a neural network model which separates the underlying sources via a challenging proxy task without supervision from the reference sources. Our second method, cyclic mixture permutation invariant training (MixCycle), uses MixPIT as a building block in a cyclic fashion for continuous learning. MixCycle gradually converts the problem from separating mixtures of mixtures into separating single mixtures. We compare our methods to common supervised and unsupervised baselines: permutation invariant training with dynamic mixing (PIT-DM) and mixture invariant training (MixIT). We show that MixCycle outperforms MixIT and reaches a performance level very close to the supervised baseline (PIT-DM) while circumventing the over-separation issue of MixIT. Also, we propose a self-evaluation technique inspired by MixCycle that estimates model performance without utilizing any reference sources. We show that it yields results consistent with an evaluation on reference sources (LibriMix) and also with an informal listening test conducted on a real-life mixtures dataset (REAL-M).
    Removing Non-Stationary Knowledge From Pre-Trained Language Models for Entity-Level Sentiment Classification in Finance. (arXiv:2301.03136v1 [cs.CL])
    Extraction of sentiment signals from news text, stock message boards, and business reports, for stock movement prediction, has been a rising field of interest in finance. Building upon past literature, the most recent works attempt to better capture sentiment from sentences with complex syntactic structures by introducing aspect-level sentiment classification (ASC). Despite the growing interest, however, fine-grained sentiment analysis has not been fully explored in non-English literature due to the shortage of annotated finance-specific data. Accordingly, it is necessary for non-English languages to leverage datasets and pre-trained language models (PLM) of different domains, languages, and tasks to best their performance. To facilitate finance-specific ASC research in the Korean language, we build KorFinASC, a Korean aspect-level sentiment classification dataset for finance consisting of 12,613 human-annotated samples, and explore methods of intermediate transfer learning. Our experiments indicate that past research has been ignorant towards the potentially wrong knowledge of financial entities encoded during the training phase, which has overestimated the predictive power of PLMs. In our work, we use the term "non-stationary knowledge'' to refer to information that was previously correct but is likely to change, and present "TGT-Masking'', a novel masking pattern to restrict PLMs from speculating knowledge of the kind. Finally, through a series of transfer learning with TGT-Masking applied we improve 22.63% of classification accuracy compared to standalone models on KorFinASC.
    A review of clustering models in educational data science towards fairness-aware learning. (arXiv:2301.03421v1 [cs.LG])
    Ensuring fairness is essential for every education system. Machine learning is increasingly supporting the education system and educational data science (EDS) domain, from decision support to educational activities and learning analytics. However, the machine learning-based decisions can be biased because the algorithms may generate the results based on students' protected attributes such as race or gender. Clustering is an important machine learning technique to explore student data in order to support the decision-maker, as well as support educational activities, such as group assignments. Therefore, ensuring high-quality clustering models along with satisfying fairness constraints are important requirements. This chapter comprehensively surveys clustering models and their fairness in EDS. We especially focus on investigating the fair clustering models applied in educational activities. These models are believed to be practical tools for analyzing students' data and ensuring fairness in EDS.
    Federated Learning with Domain Generalization. (arXiv:2111.10487v2 [cs.LG] UPDATED)
    Federated Learning (FL) enables a group of clients to jointly train a machine learning model with the help of a centralized server. Clients do not need to submit their local data to the server during training, and hence the local training data of clients is protected. In FL, distributed clients collect their local data independently, so the dataset of each client may naturally form a distinct source domain. In practice, the model trained over multiple source domains may have poor generalization performance on unseen target domains. To address this issue, we propose FedADG to equip federated learning with domain generalization capability. FedADG employs the federated adversarial learning approach to measure and align the distributions among different source domains via matching each distribution to a reference distribution. The reference distribution is adaptively generated (by accommodating all source domains) to minimize the domain shift distance during alignment. In FedADG, the alignment is fine-grained since each class is aligned independently. In this way, the learned feature representation is supposed to be universal, so it can generalize well on the unseen domains. Intensive experiments on various datasets demonstrate that FedADG has comparable performance with the state-of-the-art.
    Why Batch Normalization Damage Federated Learning on Non-IID Data?. (arXiv:2301.02982v1 [cs.LG])
    As a promising distributed learning paradigm, federated learning (FL) involves training deep neural network (DNN) models at the network edge while protecting the privacy of the edge clients. To train a large-scale DNN model, batch normalization (BN) has been regarded as a simple and effective means to accelerate the training and improve the generalization capability. However, recent findings indicate that BN can significantly impair the performance of FL in the presence of non-i.i.d. data. While several FL algorithms have been proposed to address this issue, their performance still falls significantly when compared to the centralized scheme. Furthermore, none of them have provided a theoretical explanation of how the BN damages the FL convergence. In this paper, we present the first convergence analysis to show that under the non-i.i.d. data, the mismatch between the local and global statistical parameters in BN causes the gradient deviation between the local and global models, which, as a result, slows down and biases the FL convergence. In view of this, we develop a new FL algorithm that is tailored to BN, called FedTAN, which is capable of achieving robust FL performance under a variety of data distributions via iterative layer-wise parameter aggregation. Comprehensive experimental results demonstrate the superiority of the proposed FedTAN over existing baselines for training BN-based DNN models.  ( 2 min )
    Physics-Informed Kernel Embeddings: Integrating Prior System Knowledge with Data-Driven Control. (arXiv:2301.03565v1 [eess.SY])
    Data-driven control algorithms use observations of system dynamics to construct an implicit model for the purpose of control. However, in practice, data-driven techniques often require excessive sample sizes, which may be infeasible in real-world scenarios where only limited observations of the system are available. Furthermore, purely data-driven methods often neglect useful a priori knowledge, such as approximate models of the system dynamics. We present a method to incorporate such prior knowledge into data-driven control algorithms using kernel embeddings, a nonparametric machine learning technique based in the theory of reproducing kernel Hilbert spaces. Our proposed approach incorporates prior knowledge of the system dynamics as a bias term in the kernel learning problem. We formulate the biased learning problem as a least-squares problem with a regularization term that is informed by the dynamics, that has an efficiently computable, closed-form solution. Through numerical experiments, we empirically demonstrate the improved sample efficiency and out-of-sample generalization of our approach over a purely data-driven baseline. We demonstrate an application of our method to control through a target tracking problem with nonholonomic dynamics, and on spring-mass-damper and F-16 aircraft state prediction tasks.
    Equivariant and Steerable Neural Networks: A review with special emphasis on the symmetric group. (arXiv:2301.03019v1 [cs.LG])
    Convolutional neural networks revolutionized computer vision and natrual language processing. Their efficiency, as compared to fully connected neural networks, has its origin in the architecture, where convolutions reflect the translation invariance in space and time in pattern or speech recognition tasks. Recently, Cohen and Welling have put this in the broader perspective of invariance under symmetry groups, which leads to the concept of group equivaiant neural networks and more generally steerable neural networks. In this article, we review the architecture of such networks including equivariant layers and filter banks, activation with capsules and group pooling. We apply this formalism to the symmetric group, for which we work out a number of details on representations and capsules that are not found in the literature.  ( 2 min )
    Systems for Parallel and Distributed Large-Model Deep Learning Training. (arXiv:2301.02691v1 [cs.DC])
    Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore increasingly large neural architectures, with some recent Transformer models spanning hundreds of billions of learnable parameters. These designs have introduced new scale-driven systems challenges for the DL space, such as memory bottlenecks, poor runtime efficiency, and high costs of model development. Efforts to address these issues have explored techniques such as parallelization of neural architectures, spilling data across the memory hierarchy, and memory-efficient data representations. This survey will explore the large-model training systems landscape, highlighting key challenges and the various techniques that have been used to address them.  ( 2 min )
    Exploration in Model-based Reinforcement Learning with Randomized Reward. (arXiv:2301.03142v1 [stat.ML])
    Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.  ( 2 min )
    Machine Learning to Estimate Gross Loss of Jewelry for Wax Patterns. (arXiv:2301.02872v1 [cs.LG])
    In mass manufacturing of jewellery, the gross loss is estimated before manufacturing to calculate the wax weight of the pattern that would be investment casted to make multiple identical pieces of jewellery. Machine learning is a technology that is a part of AI which helps create a model with decision-making capabilities based on a large set of user-defined data. In this paper, the authors found a way to use Machine Learning in the jewellery industry to estimate this crucial Gross Loss. Choosing a small data set of manufactured rings and via regression analysis, it was found out that there is a potential of reducing the error in estimation from +-2-3 to +-0.5 using ML Algorithms from historic data and attributes collected from the CAD file during the design phase itself. To evaluate the approach's viability, additional study must be undertaken with a larger data set.  ( 2 min )
    DebiasedDTA: A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models. (arXiv:2107.05556v5 [q-bio.QM] UPDATED)
    Computational models that accurately predict the binding affinity of an input protein-chemical pair can accelerate drug discovery studies. These models are trained on available protein-chemical interaction datasets, which may contain dataset biases that may lead the model to learn dataset-specific patterns, instead of generalizable relationships. As a result, the prediction performance of models drops for previously unseen biomolecules, $\textit{i.e.}$ the prediction models cannot generalize to biomolecules outside of the dataset. The latest approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading prediction performance. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve the generalizability of affinity prediction models. DebiasedDTA reweights the training samples to mitigate the effect of dataset biases and is applicable to most DTA prediction models. The results suggest that models trained in the DebiasedDTA framework can achieve improved generalizability in predicting the interactions of the previously unseen biomolecules, as well as performance improvements on those previously seen. Extensive experiments with different biomolecule representations, model architectures, and datasets demonstrate that DebiasedDTA can upgrade DTA prediction models irrespective of the biomolecule representation, model architecture, and training dataset. Last but not least, we release DebiasedDTA as an open-source python library to enable other researchers to debias their own predictors and/or develop their own debiasing methods. We believe that this python library will corroborate and foster research to develop more generalizable DTA prediction models.
    FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models. (arXiv:2301.02959v1 [cs.LG])
    Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of almost 6x over the prior state-of-the-art approach.
    Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition. (arXiv:2301.02736v1 [eess.AS])
    Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods, to allow for flexible post-training adaptation to new data distributions. In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR via an approximate k-nearest-neighbor (KNN) based attentive fusion step. Our experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours while providing up to 3% WER improvement compared to a fine-tuning baseline, suggesting a promising approach for adapting production ASR systems in challenging zero and few-shot scenarios.  ( 2 min )
    The Optimal Input-Independent Baseline for Binary Classification: The Dutch Draw. (arXiv:2301.03318v1 [cs.LG])
    Before any binary classification model is taken into practice, it is important to validate its performance on a proper test set. Without a frame of reference given by a baseline method, it is impossible to determine if a score is `good' or `bad'. The goal of this paper is to examine all baseline methods that are independent of feature values and determine which model is the `best' and why. By identifying which baseline models are optimal, a crucial selection decision in the evaluation process is simplified. We prove that the recently proposed Dutch Draw baseline is the best input-independent classifier (independent of feature values) for all positional-invariant measures (independent of sequence order) assuming that the samples are randomly shuffled. This means that the Dutch Draw baseline is the optimal baseline under these intuitive requirements and should therefore be used in practice.  ( 2 min )
    PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks. (arXiv:2006.07794v2 [cs.LG] UPDATED)
    Large capacity deep learning models are often prone to a high generalization gap when trained with a limited amount of labeled training data. A recent class of methods to address this problem uses various ways to construct a new training sample by mixing a pair (or more) of training samples. We propose PatchUp, a hidden state block-level regularization technique for Convolutional Neural Networks (CNNs), that is applied on selected contiguous blocks of feature maps from a random pair of samples. Our approach improves the robustness of CNN models against the manifold intrusion problem that may occur in other state-of-the-art mixing approaches. Moreover, since we are mixing the contiguous block of features in the hidden space, which has more dimensions than the input space, we obtain more diverse samples for training towards different dimensions. Our experiments on CIFAR10/100, SVHN, Tiny-ImageNet, and ImageNet using ResNet architectures including PreActResnet18/34, WRN-28-10, ResNet101/152 models show that PatchUp improves upon, or equals, the performance of current state-of-the-art regularizers for CNNs. We also show that PatchUp can provide a better generalization to deformed samples and is more robust against adversarial attacks.  ( 2 min )
    A Newton-CG based augmented Lagrangian method for finding a second-order stationary point of nonconvex equality constrained optimization with complexity guarantees. (arXiv:2301.03139v1 [math.OC])
    In this paper we consider finding a second-order stationary point (SOSP) of nonconvex equality constrained optimization when a nearly feasible point is known. In particular, we first propose a new Newton-CG method for finding an approximate SOSP of unconstrained optimization and show that it enjoys a substantially better complexity than the Newton-CG method [56]. We then propose a Newton-CG based augmented Lagrangian (AL) method for finding an approximate SOSP of nonconvex equality constrained optimization, in which the proposed Newton-CG method is used as a subproblem solver. We show that under a generalized linear independence constraint qualification (GLICQ), our AL method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-7/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of nonconvex equality constrained optimization with high probability, which are significantly better than the ones achieved by the proximal AL method [60]. Besides, we show that it has a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ when the GLICQ does not hold. To the best of our knowledge, all the complexity results obtained in this paper are new for finding an approximate SOSP of nonconvex equality constrained optimization with high probability. Preliminary numerical results also demonstrate the superiority of our proposed methods over the ones in [56,60].  ( 2 min )
    Fast and Correct Gradient-Based Optimisation for Probabilistic Programming via Smoothing. (arXiv:2301.03415v1 [cs.PL])
    We study the foundations of variational inference, which frames posterior inference as an optimisation problem, for probabilistic programming. The dominant approach for optimisation in practice is stochastic gradient descent. In particular, a variant using the so-called reparameterisation gradient estimator exhibits fast convergence in a traditional statistics setting. Unfortunately, discontinuities, which are readily expressible in programming languages, can compromise the correctness of this approach. We consider a simple (higher-order, probabilistic) programming language with conditionals, and we endow our language with both a measurable and a smoothed (approximate) value semantics. We present type systems which establish technical pre-conditions. Thus we can prove stochastic gradient descent with the reparameterisation gradient estimator to be correct when applied to the smoothed problem. Besides, we can solve the original problem up to any error tolerance by choosing an accuracy coefficient suitably. Empirically we demonstrate that our approach has a similar convergence as a key competitor, but is simpler, faster, and attains orders of magnitude reduction in work-normalised variance.
    XDQN: Inherently Interpretable DQN through Mimicking. (arXiv:2301.03043v1 [cs.LG])
    Although deep reinforcement learning (DRL) methods have been successfully applied in challenging tasks, their application in real-world operational settings is challenged by methods' limited ability to provide explanations. Among the paradigms for explainability in DRL is the interpretable box design paradigm, where interpretable models substitute inner constituent models of the DRL method, thus making the DRL method "inherently" interpretable. In this paper we explore this paradigm and we propose XDQN, an explainable variation of DQN, which uses an interpretable policy model trained through mimicking. XDQN is challenged in a complex, real-world operational multi-agent problem, where agents are independent learners solving congestion problems. Specifically, XDQN is evaluated in three MARL scenarios, pertaining to the demand-capacity balancing problem of air traffic management. XDQN achieves performance similar to that of DQN, while its abilities to provide global models' interpretations and interpretations of local decisions are demonstrated.  ( 2 min )
    Differentiable Simulations for Enhanced Sampling of Rare Events. (arXiv:2301.03480v1 [physics.chem-ph])
    We develop a novel approach to enhanced sampling of chemically reactive events using differentiable simulations. We merge the reaction path discovery and biasing potential computation into one end-to-end problem and solve it by path-integral optimization. The techniques developed contribute directly to the understanding and usability of differentiable simulations as we introduce new approaches and prove the stability properties of our method.
    Modeling Label Semantics Improves Activity Recognition. (arXiv:2301.03462v1 [cs.LG])
    Human activity recognition (HAR) aims to classify sensory time series into different activities, with wide applications in activity tracking, healthcare, human computer interaction, etc. Existing HAR works improve recognition performance by designing more complicated feature extraction methods, but they neglect the label semantics by simply treating labels as integer IDs. We find that many activities in the current HAR datasets have shared label names, e.g., "open door" and "open fridge", "walk upstairs" and "walk downstairs". Through some exploratory analysis, we find that such shared structure in activity names also maps to similarity in the input features. To this end, we design a sequence-to-sequence framework to decode the label name semantics rather than classifying labels as integer IDs. Our proposed method decomposes learning activities into learning shared tokens ("open", "walk"), which is easier than learning the joint distribution ("open fridge", "walk upstairs") and helps transfer learning to activities with insufficient data samples. For datasets originally without shared tokens in label names, we also offer an automated method, using OpenAI's ChatGPT, to generate shared actions and objects. Extensive experiments on seven HAR benchmark datasets demonstrate the state-of-the-art performance of our method. We also show better performance in the long-tail activity distribution settings and few-shot settings.
    Unsupervised Multivariate Time-Series Transformers for Seizure Identification on EEG. (arXiv:2301.03470v1 [eess.SP])
    Epilepsy is one of the most common neurological disorders, typically observed via seizure episodes. Epileptic seizures are commonly monitored through electroencephalogram (EEG) recordings due to their routine and low expense collection. The stochastic nature of EEG makes seizure identification via manual inspections performed by highly-trained experts a tedious endeavor, motivating the use of automated identification. The literature on automated identification focuses mostly on supervised learning methods requiring expert labels of EEG segments that contain seizures, which are difficult to obtain. Motivated by these observations, we pose seizure identification as an unsupervised anomaly detection problem. To this end, we employ the first unsupervised transformer-based model for seizure identification on raw EEG. We train an autoencoder involving a transformer encoder via an unsupervised loss function, incorporating a novel masking strategy uniquely designed for multivariate time-series data such as EEG. Training employs EEG recordings that do not contain any seizures, while seizures are identified with respect to reconstruction errors at inference time. We evaluate our method on three publicly available benchmark EEG datasets for distinguishing seizure vs. non-seizure windows. Our method leads to significantly better seizure identification performance than supervised learning counterparts, by up to 16% recall, 9% accuracy, and 9% Area under the Receiver Operating Characteristics Curve (AUC), establishing particular benefits on highly imbalanced data. Through accurate seizure identification, our method could facilitate widely accessible and early detection of epilepsy development, without needing expensive label collection or manual feature extraction.
    MAQA: A Multimodal QA Benchmark for Negation. (arXiv:2301.03238v1 [cs.CL])
    Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted from labeled music videos in AudioSet (Gemmeke et al., 2017) with the goal of systematically evaluating if multimodal transformers can perform complex reasoning to recognize new concepts as negation of previously learned concepts. We show that with standard fine-tuning approach multimodal transformers are still incapable of correctly interpreting negation irrespective of model size. However, our experiments demonstrate that augmenting the original training task distributions with negated QA examples allow the model to reliably reason with negation. To do this, we describe a novel data generation procedure that prompts the 540B-parameter PaLM model to automatically generate negated QA examples as compositions of easily accessible video tags. The generated examples contain more natural linguistic patterns and the gains compared to template-based task augmentation approach are significant.
    Deep Learning for Short-Latency Epileptic Seizure Detection with Probabilistic Classification. (arXiv:2301.03465v1 [eess.SP])
    In this manuscript, we propose a novel deep learning (DL)-based framework intended for obtaining short latency in real-time electroencephalogram-based epileptic seizure detection using multiscale 3D convolutional neural networks. We pioneer converting seizure detection task from traditional binary classification of samples from ictal and interictal periods to probabilistic classification of samples from interictal, ictal, and crossing periods. We introduce a crossing period from seizure-oriented EEG recording and propose a labelling rule using soft-label for samples from the crossing period to build a probabilistic classification task. A novel multiscale short-time Fourier transform feature extraction method and 3D convolution neural network architecture are proposed to accurately capture predictive probabilities of samples. Furthermore, we also propose rectified weighting strategy to enhance predictive probabilities, and accumulative decision-making rule to achieve short detection latency. We implement leave-one-seizure-out cross validation on two prevalent datasets -- CHB-MIT scalp EEG dataset and SWEC-ETHZ intracranial EEG dataset. Eventually, the proposed algorithm achieved 94 out of 99 seizures detected during the crossing period, averaged 14.84% rectified predictive ictal probability (RPIP) errors of crossing samples, 2.3 s detection latency, 0.32/h false detection rate on CHB-MIT dataset, meanwhile 84 out of 89 detected seizures, 16.17% RPIP errors, 4.7 s detection latency, and 0.75/h FDR are achieved on SWEC-ETHZ dataset. The obtained detection latencies are at least 50% faster than state-of-the-art results reported in previous studies.
    Safer Together: Machine Learning Models Trained on Shared Accident Datasets Predict Construction Injuries Better than Company-Specific Models. (arXiv:2301.03567v1 [cs.LG])
    In this study, we capitalized on a collective dataset repository of 57k accidents from 9 companies belonging to 3 domains and tested whether models trained on multiple datasets (generic models) predicted safety outcomes better than the company-specific models. We experimented with full generic models (trained on all data), per-domain generic models (construction, electric T&D, oil & gas), and with ensembles of generic and specific models. Results are very positive, with generic models outperforming the company-specific models in most cases while also generating finer-grained, hence more useful, forecasts. Successful generic models remove the needs for training company-specific models, saving a lot of time and resources, and give small companies, whose accident datasets are too limited to train their own models, access to safety outcome predictions. It may still however be advantageous to train specific models to get an extra boost in performance through ensembling with the generic models. Overall, by learning lessons from a pool of datasets whose accumulated experience far exceeds that of any single company, and making these lessons easily accessible in the form of simple forecasts, generic models tackle the holy grail of safety cross-organizational learning and dissemination in the construction industry.
    Cursive Caption Text Detection in Videos. (arXiv:2301.03164v1 [cs.CV])
    Textual content appearing in videos represents an interesting index for semantic retrieval of videos (from archives), generation of alerts (live streams) as well as high level applications like opinion mining and content summarization. One of the key components of such systems is the detection of textual content in video frames and the same makes the subject of our present study. This paper presents a robust technique for detection of textual content appearing in video frames. More specifically we target text in cursive script taking Urdu text as a case study. Detection of textual regions in video frames is carried out by fine-tuning object detectors based on deep convolutional neural networks for the specific case of text detection. Since it is common to have videos with caption text in multiple-scripts, cursive text is distinguished from Latin text using a script-identification module. Finally, detection and script identification are combined in a single end-to-end trainable system. Experiments on a comprehensive dataset of around 11,000 video frames report an F-measure of 0.91.
    eFIN: Enhanced Fourier Imager Network for generalizable autofocusing and pixel super-resolution in holographic imaging. (arXiv:2301.03162v1 [physics.optics])
    The application of deep learning techniques has greatly enhanced holographic imaging capabilities, leading to improved phase recovery and image reconstruction. Here, we introduce a deep neural network termed enhanced Fourier Imager Network (eFIN) as a highly generalizable framework for hologram reconstruction with pixel super-resolution and image autofocusing. Through holographic microscopy experiments involving lung, prostate and salivary gland tissue sections and Papanicolau (Pap) smears, we demonstrate that eFIN has a superior image reconstruction quality and exhibits external generalization to new types of samples never seen during the training phase. This network achieves a wide autofocusing axial range of 0.35 mm, with the capability to accurately predict the hologram axial distances by physics-informed learning. eFIN enables 3x pixel super-resolution imaging and increases the space-bandwidth product of the reconstructed images by 9-fold with almost no performance loss, which allows for significant time savings in holographic imaging and data processing steps. Our results showcase the advancements of eFIN in pushing the boundaries of holographic imaging for various applications in e.g., quantitative phase imaging and label-free microscopy.  ( 2 min )
    KIDS: kinematics-based (in)activity detection and segmentation in a sleep case study. (arXiv:2301.03469v1 [eess.SP])
    Sleep behaviour and in-bed movements contain rich information on the neurophysiological health of people, and have a direct link to the general well-being and quality of life. Standard clinical practices rely on polysomnography for sleep assessment; however, it is intrusive, performed in unfamiliar environments and requires trained personnel. Progress has been made on less invasive sensor technologies, such as actigraphy, but clinical validation raises concerns over their reliability and precision. Additionally, the field lacks a widely acceptable algorithm, with proposed approaches ranging from raw signal or feature thresholding to data-hungry classification models, many of which are unfamiliar to medical staff. This paper proposes an online Bayesian probabilistic framework for objective (in)activity detection and segmentation based on clinically meaningful joint kinematics, measured by a custom-made wearable sensor. Intuitive three-dimensional visualisations of kinematic timeseries were accomplished through dimension reduction based preprocessing, offering out-of-the-box framework explainability potentially useful for clinical monitoring and diagnosis. The proposed framework attained up to 99.2\% $F_1$-score and 0.96 Pearson's correlation coefficient in, respectively, the posture change detection and inactivity segmentation tasks. The work paves the way for a reliable home-based analysis of movements during sleep which would serve patient-centred longitudinal care plans.
    Network Slicing via Transfer Learning aided Distributed Deep Reinforcement Learning. (arXiv:2301.03262v1 [cs.NI])
    Deep reinforcement learning (DRL) has been increasingly employed to handle the dynamic and complex resource management in network slicing. The deployment of DRL policies in real networks, however, is complicated by heterogeneous cell conditions. In this paper, we propose a novel transfer learning (TL) aided multi-agent deep reinforcement learning (MADRL) approach with inter-agent similarity analysis for inter-cell inter-slice resource partitioning. First, we design a coordinated MADRL method with information sharing to intelligently partition resource to slices and manage inter-cell interference. Second, we propose an integrated TL method to transfer the learned DRL policies among different local agents for accelerating the policy deployment. The method is composed of a new domain and task similarity measurement approach and a new knowledge transfer approach, which resolves the problem of from whom to transfer and how to transfer. We evaluated the proposed solution with extensive simulations in a system-level simulator and show that our approach outperforms the state-of-the-art solutions in terms of performance, convergence speed and sample efficiency. Moreover, by applying TL, we achieve an additional gain over 27% higher than the coordinate MADRL approach without TL.
    On the challenges to learn from Natural Data Streams. (arXiv:2301.03495v1 [cs.CV])
    In real-world contexts, sometimes data are available in form of Natural Data Streams, i.e. data characterized by a streaming nature, unbalanced distribution, data drift over a long time frame and strong correlation of samples in short time ranges. Moreover, a clear separation between the traditional training and deployment phases is usually lacking. This data organization and fruition represents an interesting and challenging scenario for both traditional Machine and Deep Learning algorithms and incremental learning agents, i.e. agents that have the ability to incrementally improve their knowledge through the past experience. In this paper, we investigate the classification performance of a variety of algorithms that belong to various research field, i.e. Continual, Streaming and Online Learning, that receives as training input Natural Data Streams. The experimental validation is carried out on three different datasets, expressly organized to replicate this challenging setting.
    Deep Learning for Mean Field Games with non-separable Hamiltonians. (arXiv:2301.02877v1 [cs.LG])
    This paper introduces a new method based on Deep Galerkin Methods (DGMs) for solving high-dimensional stochastic Mean Field Games (MFGs). We achieve this by using two neural networks to approximate the unknown solutions of the MFG system and forward-backward conditions. Our method is efficient, even with a small number of iterations, and is capable of handling up to 300 dimensions with a single layer, which makes it faster than other approaches. In contrast, methods based on Generative Adversarial Networks (GANs) cannot solve MFGs with non-separable Hamiltonians. We demonstrate the effectiveness of our approach by applying it to a traffic flow problem, which was previously solved using the Newton iteration method only in the deterministic case. We compare the results of our method to analytical solutions and previous approaches, showing its efficiency. We also prove the convergence of our neural network approximation with a single hidden layer using the universal approximation theorem.  ( 2 min )
    Stochastic Langevin Monte Carlo for (weakly) log-concave posterior distributions. (arXiv:2301.03077v1 [stat.ML])
    In this paper, we investigate a continuous time version of the Stochastic Langevin Monte Carlo method, introduced in [WT11], that incorporates a stochastic sampling step inside the traditional over-damped Langevin diffusion. This method is popular in machine learning for sampling posterior distribution. We will pay specific attention in our work to the computational cost in terms of $n$ (the number of observations that produces the posterior distribution), and $d$ (the dimension of the ambient space where the parameter of interest is living). We derive our analysis in the weakly convex framework, which is parameterized with the help of the Kurdyka-\L ojasiewicz (KL) inequality, that permits to handle a vanishing curvature settings, which is far less restrictive when compared to the simple strongly convex case. We establish that the final horizon of simulation to obtain an $\varepsilon$ approximation (in terms of entropy) is of the order $( d \log(n)^2 )^{(1+r)^2} [\log^2(\varepsilon^{-1}) + n^2 d^{2(1+r)} \log^{4(1+r)}(n) ]$ with a Poissonian subsampling of parameter $\left(n ( d \log^2(n))^{1+r}\right)^{-1}$, where the parameter $r$ is involved in the KL inequality and varies between $0$ (strongly convex case) and $1$ (limiting Laplace situation).  ( 2 min )
    LS-DYNA Machine Learning-based Multiscale Method for Nonlinear Modeling of Short Fiber-Reinforced Composites. (arXiv:2301.02738v1 [cs.CE])
    Short-fiber-reinforced composites (SFRC) are high-performance engineering materials for lightweight structural applications in the automotive and electronics industries. Typically, SFRC structures are manufactured by injection molding, which induces heterogeneous microstructures, and the resulting nonlinear anisotropic behaviors are challenging to predict by conventional micromechanical analyses. In this work, we present a machine learning-based multiscale method by integrating injection molding-induced microstructures, material homogenization, and Deep Material Network (DMN) in the finite element simulation software LS-DYNA for structural analysis of SFRC. DMN is a physics-embedded machine learning model that learns the microscale material morphologies hidden in representative volume elements of composites through offline training. By coupling DMN with finite elements, we have developed a highly accurate and efficient data-driven approach, which predicts nonlinear behaviors of composite materials and structures at a computational speed orders-of-magnitude faster than the high-fidelity direct numerical simulation. To model industrial-scale SFRC products, transfer learning is utilized to generate a unified DMN database, which effectively captures the effects of injection molding-induced fiber orientations and volume fractions on the overall composite properties. Numerical examples are presented to demonstrate the promising performance of this LS-DYNA machine learning-based multiscale method for SFRC modeling.  ( 2 min )
    GAN-Based Content Generation of Maps for Strategy Games. (arXiv:2301.02874v1 [cs.LG])
    Maps are a very important component of strategy games, and a time-consuming task if done by hand. Maps generated by traditional PCG techniques such as Perlin noise or tile-based PCG techniques look unnatural and unappealing, thus not providing the best user experience for the players. However it is possible to have a generator that can create realistic and natural images of maps, given that it is trained how to do so. We propose a model for the generation of maps based on Generative Adversarial Networks (GAN). In our implementation we tested out different variants of GAN-based networks on a dataset of heightmaps. We conducted extensive empirical evaluation to determine the advantages and properties of each approach. The results obtained are promising, showing that it is indeed possible to generate realistic looking maps using this type of approach.  ( 2 min )
    Sublinear Time Algorithms for Several Geometric Optimization (With Outliers) Problems In Machine Learning. (arXiv:2301.02870v1 [cs.DS])
    In this paper, we study several important geometric optimization problems arising in machine learning. First, we revisit the Minimum Enclosing Ball (MEB) problem in Euclidean space $\mathbb{R}^d$. The problem has been extensively studied before, but real-world machine learning tasks often need to handle large-scale datasets so that we cannot even afford linear time algorithms. Motivated by the recent studies on {\em beyond worst-case analysis}, we introduce the notion of stability for MEB, which is natural and easy to understand. Roughly speaking, an instance of MEB is stable, if the radius of the resulting ball cannot be significantly reduced by removing a small fraction of the input points. Under the stability assumption, we present two sampling algorithms for computing radius-approximate MEB with sample complexities independent of the number of input points $n$. In particular, the second algorithm has the sample complexity even independent of the dimensionality $d$. We also consider the general case without the stability assumption. We present a hybrid algorithm that can output either a radius-approximate MEB or a covering-approximate MEB. Our algorithm improves the running time and the number of passes for the previous sublinear MEB algorithms. Our method relies on two novel techniques, the Uniform-Adaptive Sampling method and Sandwich Lemma. Furthermore, we observe that these two techniques can be generalized to design sublinear time algorithms for a broader range of geometric optimization problems with outliers in high dimensions, including MEB with outliers, one-class and two-class linear SVMs with outliers, $k$-center clustering with outliers, and flat fitting with outliers. Our proposed algorithms also work fine for kernels.  ( 2 min )
    A Survey on Transformers in Reinforcement Learning. (arXiv:2301.03044v1 [cs.LG])
    Transformer has been considered the dominating neural architecture in NLP and CV, mostly under a supervised setting. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. Hence, in this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.
    Self-Supervised Time-to-Event Modeling with Structured Medical Records. (arXiv:2301.03150v1 [cs.LG])
    Time-to-event models (also known as survival models) are used in medicine and other fields for estimating the probability distribution of the time until a particular event occurs. While providing many advantages over traditional classification models, such as naturally handling censoring, time-to-event models require more parameters and are challenging to learn in settings with limited labeled training data. High censoring rates, common in events with long time horizons, further limit available training data and exacerbate the risk of overfitting. Existing methods, such as proportional hazard or accelerated failure time-based approaches, employ distributional assumptions to reduce parameter size, but they are vulnerable to model misspecification. In this work, we address these challenges with MOTOR, a self-supervised model that leverages temporal structure found in large-scale collections of timestamped, but largely unlabeled events, typical of electronic health record data. MOTOR defines a time-to-event pretraining task that naturally captures the probability distribution of event times, making it well-suited to applications in medicine. After pretraining on 8,192 tasks auto-generated from 2.7M patients (2.4B clinical events), we evaluate the performance of our pretrained model after fine-tuning to unseen time-to-event tasks. MOTOR-derived models improve upon current state-of-the-art C statistic performance by 6.6% and decrease training time (in wall time) by up to 8.2 times. We further improve sample efficiency, with adapted models matching current state-of-the-art performance using 95% less training data.
    AI2: The next leap toward native language based and explainable machine learning framework. (arXiv:2301.03391v1 [cs.LG])
    The machine learning frameworks flourished in the last decades, allowing artificial intelligence to get out of academic circles to be applied to enterprise domains. This field has significantly advanced, but there is still some meaningful improvement to reach the subsequent expectations. The proposed framework, named AI$^{2}$, uses a natural language interface that allows a non-specialist to benefit from machine learning algorithms without necessarily knowing how to program with a programming language. The primary contribution of the AI$^{2}$ framework allows a user to call the machine learning algorithms in English, making its interface usage easier. The second contribution is greenhouse gas (GHG) awareness. It has some strategies to evaluate the GHG generated by the algorithm to be called and to propose alternatives to find a solution without executing the energy-intensive algorithm. Another contribution is a preprocessing module that helps to describe and to load data properly. Using an English text-based chatbot, this module guides the user to define every dataset so that it can be described, normalized, loaded and divided appropriately. The last contribution of this paper is about explainability. For decades, the scientific community has known that machine learning algorithms imply the famous black-box problem. Traditional machine learning methods convert an input into an output without being able to justify this result. The proposed framework explains the algorithm's process with the proper texts, graphics and tables. The results, declined in five cases, present usage applications from the user's English command to the explained output. Ultimately, the AI$^{2}$ framework represents the next leap toward native language-based, human-oriented concerns about machine learning framework.
    Fair Multi-Exit Framework for Facial Attribute Classification. (arXiv:2301.02989v1 [cs.CV])
    Fairness has become increasingly pivotal in facial recognition. Without bias mitigation, deploying unfair AI would harm the interest of the underprivileged population. In this paper, we observe that though the higher accuracy that features from the deeper layer of a neural networks generally offer, fairness conditions deteriorate as we extract features from deeper layers. This phenomenon motivates us to extend the concept of multi-exit framework. Unlike existing works mainly focusing on accuracy, our multi-exit framework is fairness-oriented, where the internal classifiers are trained to be more accurate and fairer. During inference, any instance with high confidence from an internal classifier is allowed to exit early. Moreover, our framework can be applied to most existing fairness-aware frameworks. Experiment results show that the proposed framework can largely improve the fairness condition over the state-of-the-art in CelebA and UTK Face datasets.
    Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets. (arXiv:2301.03373v1 [cs.LG])
    The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology, including email, internet browsing, robotics, and malicious software. The original contribution of this study was to examine half of the most significant code advances in the last 50 years and, in some cases, to provide notable improvements in clarity or performance. The AI-driven code assistant could provide insights into obfuscated code or software lacking explanatory commentary in all cases examined. We generated additional sample problems based on bug corrections and code optimizations requiring much deeper reasoning than a traditional Google search might provide. Future work focuses on adding automated documentation and code commentary and translating select large code bases into more modern versions with multiple new application programming interfaces (APIs) and chained multi-tasks. The AI-driven code assistant offers a valuable tool for software engineering, particularly in its ability to provide human-level expertise and assist in refactoring legacy code or simplifying the explanation or functionality of high-value repositories.
    CaSpeR: Latent Spectral Regularization for Continual Learning. (arXiv:2301.03345v1 [cs.LG])
    While biological intelligence grows organically as new knowledge is gathered throughout life, Artificial Neural Networks forget catastrophically whenever they face a changing training data distribution. Rehearsal-based Continual Learning (CL) approaches have been established as a versatile and reliable solution to overcome this limitation; however, sudden input disruptions and memory constraints are known to alter the consistency of their predictions. We study this phenomenon by investigating the geometric characteristics of the learner's latent space and find that replayed data points of different classes increasingly mix up, interfering with classification. Hence, we propose a geometric regularizer that enforces weak requirements on the Laplacian spectrum of the latent space, promoting a partitioning behavior. We show that our proposal, called Continual Spectral Regularizer (CaSpeR), can be easily combined with any rehearsal-based CL approach and improves the performance of SOTA methods on standard benchmarks. Finally, we conduct additional analysis to provide insights into CaSpeR's effects and applicability.
    Neighbor Auto-Grouping Graph Neural Networks for Handover Parameter Configuration in Cellular Network. (arXiv:2301.03412v1 [cs.NI])
    The mobile communication enabled by cellular networks is the one of the main foundations of our modern society. Optimizing the performance of cellular networks and providing massive connectivity with improved coverage and user experience has a considerable social and economic impact on our daily life. This performance relies heavily on the configuration of the network parameters. However, with the massive increase in both the size and complexity of cellular networks, network management, especially parameter configuration, is becoming complicated. The current practice, which relies largely on experts' prior knowledge, is not adequate and will require lots of domain experts and high maintenance costs. In this work, we propose a learning-based framework for handover parameter configuration. The key challenge, in this case, is to tackle the complicated dependencies between neighboring cells and jointly optimize the whole network. Our framework addresses this challenge in two ways. First, we introduce a novel approach to imitate how the network responds to different network states and parameter values, called auto-grouping graph convolutional network (AG-GCN). During the parameter configuration stage, instead of solving the global optimization problem, we design a local multi-objective optimization strategy where each cell considers several local performance metrics to balance its own performance and its neighbors. We evaluate our proposed algorithm via a simulator constructed using real network data. We demonstrate that the handover parameters our model can find, achieve better average network throughput compared to those recommended by experts as well as alternative baselines, which can bring better network quality and stability. It has the potential to massively reduce costs arising from human expert intervention and maintenance.
    Towards an AI-enabled Connected Industry: AGV Communication and Sensor Measurement Datasets. (arXiv:2301.03364v1 [cs.NI])
    This paper presents two wireless measurement campaigns in industrial testbeds: industrial Vehicle-to-vehicle (iV2V) and industrial Vehicle-to-infrastructure plus Sensor (iV2I+). Detailed information about the two captured datasets is provided as well. iV2V covers sidelink communication scenarios between Automated Guided Vehicles (AGVs), while iV2I+ is conducted at an industrial setting where an autonomous cleaning robot is connected to a private cellular network. The combination of different communication technologies, together with a common measurement methodology, provides insights that can be exploited by Machine Learning (ML) for tasks such as fingerprinting, line-of-sight detection, prediction of quality of service or link selection. Moreover, the datasets are labelled and pre-filtered for fast on-boarding and applicability. The corresponding testbeds and measurements are also presented in detail for both datasets.
    Fully Dynamic Online Selection through Online Contention Resolution Schemes. (arXiv:2301.03099v1 [cs.AI])
    We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the fully dynamic version of the problem, elements stay active for a limited time interval, and then leave. This models, for example, the online matching of tasks to workers with task/worker-dependent working times, and sequential posted pricing of perishable goods. A successful approach to online selection problems in the adversarial setting is given by the notion of Online Contention Resolution Scheme (OCRS), that uses a priori information to formulate a linear relaxation of the underlying optimization problem, whose optimal fractional solution is rounded online for any adversarial order of the input sequence. Our main contribution is providing a general method for constructing an OCRS for fully dynamic online selection problems. Then, we show how to employ such OCRS to construct no-regret algorithms in a partial information model with semi-bandit feedback and adversarial inputs.
    Emotion Recognition from Microblog Managing Emoticon with Text and Classifying using 1D CNN. (arXiv:2301.02971v1 [cs.LG])
    Microblog, an online-based broadcast medium, is a widely used forum for people to share their thoughts and opinions. Recently, Emotion Recognition (ER) from microblogs is an inspiring research topic in diverse areas. In the machine learning domain, automatic emotion recognition from microblogs is a challenging task, especially, for better outcomes considering diverse content. Emoticon becomes very common in the text of microblogs as it reinforces the meaning of content. This study proposes an emotion recognition scheme considering both the texts and emoticons from microblog data. Emoticons are considered unique expressions of the users' emotions and can be changed by the proper emotional words. The succession of emoticons appearing in the microblog data is preserved and a 1D Convolutional Neural Network (CNN) is employed for emotion classification. The experimental result shows that the proposed emotion recognition scheme outperforms the other existing methods while tested on Twitter data.
    Learning Optimal Phase-Shifts of Holographic Metasurface Transceivers. (arXiv:2301.03371v1 [eess.SP])
    Holographic metasurface transceivers (HMT) is an emerging technology for enhancing the coverage and rate of wireless communication systems. However, acquiring accurate channel state information in HMT-assisted wireless communication systems is critical for achieving these goals. In this paper, we propose an algorithm for learning the optimal phase-shifts at a HMT for the far-field channel model. Our proposed algorithm exploits the structure of the channel gains in the far-field regions and learns the optimal phase-shifts in presence of noise in the received signals. We prove that the probability that the optimal phase-shifts estimated by our proposed algorithm deviate from the true values decays exponentially in the number of pilot signals. Extensive numerical simulations validate the theoretical guarantees and also demonstrate significant gains as compared to the state-of-the-art policies.
    Machine Learning for Large-Scale Optimization in 6G Wireless Networks. (arXiv:2301.03377v1 [eess.SP])
    The sixth generation (6G) wireless systems are envisioned to enable the paradigm shift from "connected things" to "connected intelligence", featured by ultra high density, large-scale, dynamic heterogeneity, diversified functional requirements and machine learning capabilities, which leads to a growing need for highly efficient intelligent algorithms. The classic optimization-based algorithms usually require highly precise mathematical model of data links and suffer from poor performance with high computational cost in realistic 6G applications. Based on domain knowledge (e.g., optimization models and theoretical tools), machine learning (ML) stands out as a promising and viable methodology for many complex large-scale optimization problems in 6G, due to its superior performance, generalizability, computational efficiency and robustness. In this paper, we systematically review the most representative "learning to optimize" techniques in diverse domains of 6G wireless networks by identifying the inherent feature of the underlying optimization problem and investigating the specifically designed ML frameworks from the perspective of optimization. In particular, we will cover algorithm unrolling, learning to branch-and-bound, graph neural network for structured optimization, deep reinforcement learning for stochastic optimization, end-to-end learning for semantic optimization, as well as federated learning for distributed optimization, for solving challenging large-scale optimization problems arising from various important wireless applications. Through the in-depth discussion, we shed light on the excellent performance of ML-based optimization algorithms with respect to the classical methods, and provide insightful guidance to develop advanced ML techniques in 6G networks.
    A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding. (arXiv:2301.03403v1 [cs.CL])
    We provide a literature review about Automatic Text Summarization (ATS) systems. We consider a citation-based approach. We start with some popular and well-known papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
    Unsupervised ensemble-based phenotyping helps enhance the discoverability of genes related to heart morphology. (arXiv:2301.02916v1 [q-bio.GN])
    Recent genome-wide association studies (GWAS) have been successful in identifying associations between genetic variants and simple cardiac parameters derived from cardiac magnetic resonance (CMR) images. However, the emergence of big databases including genetic data linked to CMR, facilitates investigation of more nuanced patterns of shape variability. Here, we propose a new framework for gene discovery entitled Unsupervised Phenotype Ensembles (UPE). UPE builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner, using deep learning models trained with different hyperparameters. These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations across the ensemble. We apply our approach to the UK Biobank database to extract left-ventricular (LV) geometric features from image-derived three-dimensional meshes. We demonstrate that our approach greatly improves the discoverability of genes influencing LV shape, identifying 11 loci with study-wide significance and 8 with suggestive significance. We argue that our approach would enable more extensive discovery of gene associations with image-derived phenotypes for other organs or image modalities.
    Finding Lookalike Customers for E-Commerce Marketing. (arXiv:2301.03147v1 [cs.LG])
    Customer-centric marketing campaigns generate a large portion of e-commerce website traffic for Walmart. As the scale of customer data grows larger, expanding the marketing audience to reach more customers is becoming more critical for e-commerce companies to drive business growth and bring more value to customers. In this paper, we present a scalable and efficient system to expand targeted audience of marketing campaigns, which can handle hundreds of millions of customers. We use a deep learning based embedding model to represent customers and an approximate nearest neighbor search method to quickly find lookalike customers of interest. The model can deal with various business interests by constructing interpretable and meaningful customer similarity metrics. We conduct extensive experiments to demonstrate the great performance of our system and customer embedding model.
    "It's a Match!" -- A Benchmark of Task Affinity Scores for Joint Learning. (arXiv:2301.02873v1 [cs.LG])
    While the promises of Multi-Task Learning (MTL) are attractive, characterizing the conditions of its success is still an open problem in Deep Learning. Some tasks may benefit from being learned together while others may be detrimental to one another. From a task perspective, grouping cooperative tasks while separating competing tasks is paramount to reap the benefits of MTL, i.e., reducing training and inference costs. Therefore, estimating task affinity for joint learning is a key endeavor. Recent work suggests that the training conditions themselves have a significant impact on the outcomes of MTL. Yet, the literature is lacking of a benchmark to assess the effectiveness of tasks affinity estimation techniques and their relation with actual MTL performance. In this paper, we take a first step in recovering this gap by (i) defining a set of affinity scores by both revisiting contributions from previous literature as well presenting new ones and (ii) benchmarking them on the Taskonomy dataset. Our empirical campaign reveals how, even in a small-scale scenario, task affinity scoring does not correlate well with actual MTL performance. Yet, some metrics can be more indicative than others.  ( 2 min )
    Attention-LSTM for Multivariate Traffic State Prediction on Rural Roads. (arXiv:2301.02731v1 [cs.LG])
    Accurate traffic volume and speed prediction have a wide range of applications in transportation. It can result in useful and timely information for both travellers and transportation decision-makers. In this study, an Attention based Long Sort-Term Memory model (A-LSTM) is proposed to simultaneously predict traffic volume and speed in a critical rural road segmentation which connects Tehran to Chalus, the most tourist destination city in Iran. Moreover, this study compares the results of the A-LSTM model with the Long Short-Term Memory (LSTM) model. Both models show acceptable performance in predicting speed and flow. However, the A-LSTM model outperforms the LSTM in 5 and 15-minute intervals. In contrast, there is no meaningful difference between the two models for the 30-minute time interval. By comparing the performance of the models based on different time horizons, the 15-minute horizon model outperforms the others by reaching the lowest Mean Square Error (MSE) loss of 0.0032, followed by the 30 and 5-minutes horizons with 0.004 and 0.0051, respectively. In addition, this study compares the results of the models based on two transformations of temporal categorical input variables, one-hot or cyclic, for the 15-minute time interval. The results demonstrate that both LSTM and A-LSTM with cyclic feature encoding outperform those with one-hot feature encoding.  ( 2 min )
    Optimistic Meta-Gradients. (arXiv:2301.03236v1 [cs.LG])
    We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through Bootstrapped Meta-Gradients (Flennerhag et al., 2022), providing deeper insight into its underlying mechanics.
    Deep Learning in Random Neural Fields: Numerical Experiments via Neural Tangent Kernel. (arXiv:2202.05254v2 [cs.LG] UPDATED)
    A biological neural network in the cortex forms a neural field. Neurons in the field have their own receptive fields, and connection weights between two neurons are random but highly correlated when they are in close proximity in receptive fields. In this paper, we investigate such neural fields in a multilayer architecture to investigate the supervised learning of the fields. We empirically compare the performances of our field model with those of randomly connected deep networks. The behavior of a randomly connected network is investigated on the basis of the key idea of the neural tangent kernel regime, a recent development in the machine learning theory of over-parameterized networks; for most randomly connected neural networks, it is shown that global minima always exist in their small neighborhoods. We numerically show that this claim also holds for our neural fields. In more detail, our model has two structures: i) each neuron in a field has a continuously distributed receptive field, and ii) the initial connection weights are random but not independent, having correlations when the positions of neurons are close in each layer. We show that such a multilayer neural field is more robust than conventional models when input patterns are deformed by noise disturbances. Moreover, its generalization ability can be slightly superior to that of conventional models.
    Introducing Model Inversion Attacks on Automatic Speaker Recognition. (arXiv:2301.03206v1 [cs.SD])
    Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data. It has been shown that in scenarios where each class corresponds to a different individual, such as face classifiers, this represents a severe privacy risk. In this work, we explore a new application for MI: the extraction of speakers' voices from a speaker recognition system. We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics. Therefore, we propose an extension of MI attacks which we call sliding model inversion. Our sliding MI extends standard MI by iteratively inverting overlapping chunks of the audio samples and thereby leveraging the sequential properties of audio data for enhanced inversion performance. We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems on their behalf. To the best of our knowledge, our work is the first one extending MI attacks to audio data, and our results highlight the security risks resulting from the extraction of the biometric data in that setup.
    Randomized Greedy Algorithms and Composable Coreset for k-Center Clustering with Outliers. (arXiv:2301.02814v1 [cs.LG])
    In this paper, we study the problem of {\em $k$-center clustering with outliers}. The problem has many important applications in real world, but the presence of outliers can significantly increase the computational complexity. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, that was developed for solving the ordinary $k$-center clustering problem. Based on some novel observations, we show that a simple randomized version of this greedy strategy actually can handle outliers efficiently. We further show that this randomized greedy approach also yields small coreset for the problem in doubling metrics (even if the doubling dimension is not given), which can greatly reduce the computational complexity. Moreover, together with the partial clustering framework proposed in arXiv:1703.01539 , we prove that our coreset method can be applied to distributed data with a low communication complexity. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower complexities comparing with the existing methods.  ( 2 min )
    Markov Chain Concentration with an Application in Reinforcement Learning. (arXiv:2301.02926v1 [cs.LG])
    Given $X_1,\cdot ,X_N$ random variables whose joint distribution is given as $\mu$ we will use the Martingale Method to show any Lipshitz Function $f$ over these random variables is subgaussian. The Variance parameter however can have a simple expression under certain conditions. For example under the assumption that the random variables follow a Markov Chain and that the function is Lipschitz under a Weighted Hamming Metric. We shall conclude with certain well known techniques from concentration of suprema of random processes with applications in Reinforcement Learning
    Minimax Weight Learning for Absorbing MDPs. (arXiv:2301.03183v1 [cs.LG])
    Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon MDPs. In this paper, we study undiscounted off-policy policy evaluation for absorbing MDPs. Given the dataset consisting of the i.i.d episodes with a given truncation level, we propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound for the MWLA method is investigated and the dependence of statistical errors on the data size and the truncation level are analyzed. With an episodic taxi environment, computational experiments illustrate the performance of the MWLA algorithm.
    How to Allocate your Label Budget? Choosing between Active Learning and Learning to Reject in Anomaly Detection. (arXiv:2301.02909v1 [cs.LG])
    Anomaly detection attempts at finding examples that deviate from the expected behaviour. Usually, anomaly detection is tackled from an unsupervised perspective because anomalous labels are rare and difficult to acquire. However, the lack of labels makes the anomaly detector have high uncertainty in some regions, which usually results in poor predictive performance or low user trust in the predictions. One can reduce such uncertainty by collecting specific labels using Active Learning (AL), which targets examples close to the detector's decision boundary. Alternatively, one can increase the user trust by allowing the detector to abstain from making highly uncertain predictions, which is called Learning to Reject (LR). One way to do this is by thresholding the detector's uncertainty based on where its performance is low, which requires labels to be evaluated. Although both AL and LR need labels, they work with different types of labels: AL seeks strategic labels, which are evidently biased, while LR requires i.i.d. labels to evaluate the detector's performance and set the rejection threshold. Because one usually has a unique label budget, deciding how to optimally allocate it is challenging. In this paper, we propose a mixed strategy that, given a budget of labels, decides in multiple rounds whether to use the budget to collect AL labels or LR labels. The strategy is based on a reward function that measures the expected gain when allocating the budget to either side. We evaluate our strategy on 18 benchmark datasets and compare it to some baselines.  ( 2 min )
    Active Deep Learning Guided by Efficient Gaussian Process Surrogates. (arXiv:2301.02761v1 [cs.LG])
    The success of active learning relies on the exploration of the underlying data-generating distributions, populating sparsely labeled data areas, and exploitation of the information about the task gained by the baseline (neural network) learners. In this paper, we present a new algorithm that combines these two active learning modes. Our algorithm adopts a Bayesian surrogate for the baseline learner, and it optimizes the exploration process by maximizing the gain of information caused by new labels. Further, by instantly updating the surrogate learner for each new data instance, our model can faithfully simulate and exploit the continuous learning behavior of the learner without having to actually retrain it per label. In experiments with four benchmark classification datasets, our method demonstrated significant performance gain over state-of-the-arts.
    A Characterization of Multilabel Learnability. (arXiv:2301.02729v1 [cs.LG])
    We consider the problem of multilabel classification and investigate learnability in batch and online settings. In both settings, we show that a multilabel function class is learnable if and only if each single-label restriction of the function class is learnable. As extensions, we also study multioutput regression in the batch setting and bandit feedback in the online setting. For the former, we characterize learnability w.r.t. $L_p$ losses. For the latter, we show a similar characterization as in the full-feedback setting.
    Faithful and Consistent Graph Neural Network Explanations with Rationale Alignment. (arXiv:2301.02791v1 [cs.LG])
    Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over recent years. Instance-level GNN explanation aims to discover critical input elements, like nodes or edges, that the target GNN relies upon for making predictions. %These identified sub-structures can provide interpretations of GNN's behavior. Though various algorithms are proposed, most of them formalize this task by searching the minimal subgraph which can preserve original predictions. However, an inductive bias is deep-rooted in this framework: several subgraphs can result in the same or similar outputs as the original graphs. Consequently, they have the danger of providing spurious explanations and failing to provide consistent explanations. Applying them to explain weakly-performed GNNs would further amplify these issues. To address this problem, we theoretically examine the predictions of GNNs from the causality perspective. Two typical reasons for spurious explanations are identified: confounding effect of latent variables like distribution shift, and causal factors distinct from the original input. Observing that both confounding effects and diverse causal rationales are encoded in internal representations, \tianxiang{we propose a new explanation framework with an auxiliary alignment loss, which is theoretically proven to be optimizing a more faithful explanation objective intrinsically. Concretely for this alignment loss, a set of different perspectives are explored: anchor-based alignment, distributional alignment based on Gaussian mixture models, mutual-information-based alignment, etc. A comprehensive study is conducted both on the effectiveness of this new framework in terms of explanation faithfulness/consistency and on the advantages of these variants.
    Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement. (arXiv:2301.03028v1 [cs.LG])
    Time series forecasting has been a widely explored task of great importance in many applications. However, it is common that real-world time series data are recorded in a short time period, which results in a big gap between the deep model and the limited and noisy time series. In this work, we propose to address the time series forecasting problem with generative modeling and propose a bidirectional variational auto-encoder (BVAE) equipped with diffusion, denoise, and disentanglement, namely D3VAE. Specifically, a coupled diffusion probabilistic model is proposed to augment the time series data without increasing the aleatoric uncertainty and implement a more tractable inference process with BVAE. To ensure the generated series move toward the true target, we further propose to adapt and integrate the multiscale denoising score matching into the diffusion process for time series forecasting. In addition, to enhance the interpretability and stability of the prediction, we treat the latent variable in a multivariate manner and disentangle them on top of minimizing total correlation. Extensive experiments on synthetic and real-world data show that D3VAE outperforms competitive algorithms with remarkable margins. Our implementation is available at https://github.com/PaddlePaddle/PaddleSpatial/tree/main/research/D3VAE.
    Prognosis and Treatment Prediction of Type-2 Diabetes Using Deep Neural Network and Machine Learning Classifiers. (arXiv:2301.03093v1 [cs.LG])
    Type 2 Diabetes is a fast-growing, chronic metabolic disorder due to imbalanced insulin activity.The motion of this research is a comparative study of seven machine learning classifiers and an artificial neural network method to prognosticate the detection and treatment of diabetes with high accuracy,in order to identify and treat diabetes patients at an early age.Our training and test dataset is an accumulation of 9483 diabetes patients information.The training dataset is large enough to negate overfitting and provide for highly accurate test performance.We use performance measures such as accuracy and precision to find out the best algorithm deep ANN which outperforms with 95.14% accuracy among all other tested machine learning classifiers.We hope our high-performing model can be used by hospitals to predict diabetes and drive research into more accurate prediction models.
    REaaS: Enabling Adversarially Robust Downstream Classifiers via Robust Encoder as a Service. (arXiv:2301.02905v1 [cs.CR])
    Encoder as a service is an emerging cloud service. Specifically, a service provider first pre-trains an encoder (i.e., a general-purpose feature extractor) via either supervised learning or self-supervised learning and then deploys it as a cloud service API. A client queries the cloud service API to obtain feature vectors for its training/testing inputs when training/testing its classifier (called downstream classifier). A downstream classifier is vulnerable to adversarial examples, which are testing inputs with carefully crafted perturbation that the downstream classifier misclassifies. Therefore, in safety and security critical applications, a client aims to build a robust downstream classifier and certify its robustness guarantees against adversarial examples. What APIs should the cloud service provide, such that a client can use any certification method to certify the robustness of its downstream classifier against adversarial examples while minimizing the number of queries to the APIs? How can a service provider pre-train an encoder such that clients can build more certifiably robust downstream classifiers? We aim to answer the two questions in this work. For the first question, we show that the cloud service only needs to provide two APIs, which we carefully design, to enable a client to certify the robustness of its downstream classifier with a minimal number of queries to the APIs. For the second question, we show that an encoder pre-trained using a spectral-norm regularization term enables clients to build more robust downstream classifiers.
    Transfer learning for non-intrusive load monitoring and appliance identification in a smart home. (arXiv:2301.03018v1 [eess.SP])
    Non-intrusive load monitoring (NILM) or energy disaggregation is an inverse problem whereby the goal is to extract the load profiles of individual appliances, given an aggregate load profile of the mains of a home. NILM could help identify the power usage patterns of individual appliances in a home, and thus, could help realize novel energy conservation schemes for smart homes. In this backdrop, this work proposes a novel deep-learning approach to solve the NILM problem and a few related problems as follows. 1) We build upon the reputed seq2-point convolutional neural network (CNN) model to come up with the proposed seq2-[3]-point CNN model to solve the (home) NILM problem and site-NILM problem (basically, NILM at a smaller scale). 2) We solve the related problem of appliance identification by building upon the state-of-the-art (pre-trained) 2D-CNN models, i.e., AlexNet, ResNet-18, and DenseNet-121, which are trained upon two custom datasets that consist of Wavelets and short-time Fourier transform (STFT)-based 2D electrical signatures of the appliances. 3) Finally, we do some basic qualitative inference about an individual appliance's health by comparing the power consumption of the same appliance across multiple homes. Low-frequency REDD dataset is used to train and test the proposed deep learning models for all problems, except site-NILM where REFIT dataset has been used. As for the results, we achieve a maximum accuracy of 94.6\% for home-NILM, 81\% for site-NILM, and 88.9\% for appliance identification (with Resnet-based model).
    k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality. (arXiv:2301.02896v1 [cs.LG])
    In today's data-driven world, the sensitivity of information has been a significant concern. With this data and additional information on the person's background, one can easily infer an individual's private data. Many differentially private iterative algorithms have been proposed in interactive settings to protect an individual's privacy from these inference attacks. The existing approaches adapt the method to compute differentially private(DP) centroids by iterative Llyod's algorithm and perturbing the centroid with various DP mechanisms. These DP mechanisms do not guarantee convergence of differentially private iterative algorithms and degrade the quality of the cluster. Thus, in this work, we further extend the previous work on 'Differentially Private k-Means Clustering With Convergence Guarantee' by taking it as our baseline. The novelty of our approach is to sub-cluster the clusters and then select the centroid which has a higher probability of moving in the direction of the future centroid. At every Lloyd's step, the centroids are injected with the noise using the exponential DP mechanism. The results of the experiments indicate that our approach outperforms the current state-of-the-art method, i.e., the baseline algorithm, in terms of clustering quality while maintaining the same differential privacy requirements. The clustering quality significantly improved by 4.13 and 2.83 times than baseline for the Wine and Breast_Cancer dataset, respectively.
    Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching. (arXiv:2301.02903v1 [cs.LG])
    Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with supervised learning (69.8%).
    Traditional Readability Formulas Compared for English. (arXiv:2301.02975v1 [cs.CL])
    Traditional English readability formulas, or equations, were largely developed in the 20th century. Nonetheless, many researchers still rely on them for various NLP applications. Such a phenomenon is presumably due to the convenience and straightforwardness of readability formulas. In this work, we contribute to the NLP community by 1. introducing New English Readability Formula (NERF), 2. recalibrating the coefficients of old readability formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluating the readability formulas, for use in text simplification studies and medical texts, and 4. developing a Python-based program for the wide application to various NLP projects.
    Online Centralized Non-parametric Change-point Detection via Graph-based Likelihood-ratio Estimation. (arXiv:2301.03011v1 [stat.ML])
    Consider each node of a graph to be generating a data stream that is synchronized and observed at near real-time. At a change-point $\tau$, a change occurs at a subset of nodes $C$, which affects the probability distribution of their associated node streams. In this paper, we propose a novel kernel-based method to both detect $\tau$ and localize $C$, based on the direct estimation of the likelihood-ratio between the post-change and the pre-change distributions of the node streams. Our main working hypothesis is the smoothness of the likelihood-ratio estimates over the graph, i.e connected nodes are expected to have similar likelihood-ratios. The quality of the proposed method is demonstrated on extensive experiments on synthetic scenarios.
    Learning the Relation between Similarity Loss and Clustering Loss in Self-Supervised Learning. (arXiv:2301.03041v1 [cs.CV])
    Self-supervised learning enables networks to learn discriminative features from massive data itself. Most state-of-the-art methods maximize the similarity between two augmentations of one image based on contrastive learning. By utilizing the consistency of two augmentations, the burden of manual annotations can be freed. Contrastive learning exploits instance-level information to learn robust features. However, the learned information is probably confined to different views of the same instance. In this paper, we attempt to leverage the similarity between two distinct images to boost representation in self-supervised learning. In contrast to instance-level information, the similarity between two distinct images may provide more useful information. Besides, we analyze the relation between similarity loss and feature-level cross-entropy loss. These two losses are essential for most deep learning methods. However, the relation between these two losses is not clear. Similarity loss helps obtain instance-level representation, while feature-level cross-entropy loss helps mine the similarity between two distinct images. We provide theoretical analyses and experiments to show that a suitable combination of these two losses can get state-of-the-art results.
    Unsupervised Learning for Combinatorial Optimization Needs Meta-Learning. (arXiv:2301.03116v1 [cs.LG])
    A general framework of unsupervised learning for combinatorial optimization (CO) is to train a neural network (NN) whose output gives a problem solution by directly optimizing the CO objective. Albeit with some advantages over traditional solvers, the current framework optimizes an averaged performance over the distribution of historical problem instances, which misaligns with the actual goal of CO that looks for a good solution to every future encountered instance. With this observation, we propose a new objective of unsupervised learning for CO where the goal of learning is to search for good initialization for future problem instances rather than give direct solutions. We propose a meta-learning-based training pipeline for this new objective. Our method achieves go empirical performance. We observe that even just the initial solution given by our model before fine-tuning can significantly outperform the baselines under various evaluation settings including evaluation across multiple datasets, and the case with big shifts in the problem scale. The reason we conjecture is that meta-learning-based training lets the model loosely tied to each local optima for a training instance while being more adaptive to the changes of optimization landscapes across instances.
    Explaining Graph Neural Networks via Non-parametric Subgraph Matching. (arXiv:2301.02780v1 [cs.LG])
    The great success in graph neural networks (GNNs) provokes the question about explainability: Which fraction of the input graph is the most determinant of the prediction? Particularly, parametric explainers prevail in existing approaches because of their stronger capability to decipher the black-box (i.e., the target GNN). In this paper, based on the observation that graphs typically share some joint motif patterns, we propose a novel non-parametric subgraph matching framework, dubbed MatchExplainer, to explore explanatory subgraphs. It couples the target graph with other counterpart instances and identifies the most crucial joint substructure by minimizing the node corresponding-based distance. Moreover, we note that present graph sampling or node-dropping methods usually suffer from the false positive sampling problem. To ameliorate that issue, we design a new augmentation paradigm named MatchDrop. It takes advantage of MatchExplainer to fix the most informative portion of the graph and merely operates graph augmentations on the rest less informative part. We conduct extensive experiments on both synthetic and real-world datasets and show the effectiveness of our MatchExplainer by outperforming all parametric baselines with significant margins. Additional results also demonstrate that our MatchDrop is a general scheme to be equipped with GNNs for enhanced performance.
    The 3D Structural Phenotype of the Glaucomatous Optic Nerve Head and its Relationship with The Severity of Visual Field Damage. (arXiv:2301.02837v1 [cs.LG])
    $\bf{Purpose}$: To describe the 3D structural changes in both connective and neural tissues of the optic nerve head (ONH) that occur concurrently at different stages of glaucoma using traditional and AI-driven approaches. $\bf{Methods}$: We included 213 normal, 204 mild glaucoma (mean deviation [MD] $\ge$ -6.00 dB), 118 moderate glaucoma (MD of -6.01 to -12.00 dB), and 118 advanced glaucoma patients (MD < -12.00 dB). All subjects had their ONHs imaged in 3D with Spectralis optical coherence tomography. To describe the 3D structural phenotype of glaucoma as a function of severity, we used two different approaches: (1) We extracted human-defined 3D structural parameters of the ONH including retinal nerve fiber layer (RNFL) thickness, lamina cribrosa (LC) shape and depth at different stages of glaucoma; (2) we also employed a geometric deep learning method (i.e. PointNet) to identify the most important 3D structural features that differentiate ONHs from different glaucoma severity groups without any human input. $\bf{Results}$: We observed that the majority of ONH structural changes occurred in the early glaucoma stage, followed by a plateau effect in the later stages. Using PointNet, we also found that 3D ONH structural changes were present in both neural and connective tissues. In both approaches, we observed that structural changes were more prominent in the superior and inferior quadrant of the ONH, particularly in the RNFL, the prelamina, and the LC. As the severity of glaucoma increased, these changes became more diffuse (i.e. widespread), particularly in the LC. $\bf{Conclusions}$: In this study, we were able to uncover complex 3D structural changes of the ONH in both neural and connective tissues as a function of glaucoma severity. We hope to provide new insights into the complex pathophysiology of glaucoma that might help clinicians in their daily clinical care.
    Reducing Over-smoothing in Graph Neural Networks Using Relational Embeddings. (arXiv:2301.02924v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved a lot of success with graph-structured data. However, it is observed that the performance of GNNs does not improve (or even worsen) as the number of layers increases. This effect has known as over-smoothing, which means that the representations of the graph nodes of different classes would become indistinguishable when stacking multiple layers. In this work, we propose a new simple, and efficient method to alleviate the effect of the over-smoothing problem in GNNs by explicitly using relations between node embeddings. Experiments on real-world datasets demonstrate that utilizing node embedding relations makes GNN models such as Graph Attention Network more robust to over-smoothing and achieves better performance with deeper GNNs. Our method can be used in combination with other methods to give the best performance. GNN applications are endless and depend on the user's objective and the type of data that they possess. Solving over-smoothing issues can potentially improve the performance of models on all these tasks.
    AutoAC: Towards Automated Attribute Completion for Heterogeneous Graph Neural Network. (arXiv:2301.03049v1 [cs.LG])
    Many real-world data can be modeled as heterogeneous graphs that contain multiple types of nodes and edges. Meanwhile, due to excellent performance, heterogeneous graph neural networks (GNNs) have received more and more attention. However, the existing work mainly focuses on the design of novel GNN models, while ignoring another important issue that also has a large impact on the model performance, namely the missing attributes of some node types. The handcrafted attribute completion requires huge expert experience and domain knowledge. Also, considering the differences in semantic characteristics between nodes, the attribute completion should be fine-grained, i.e., the attribute completion operation should be node-specific. Moreover, to improve the performance of the downstream graph learning task, attribute completion and the training of the heterogeneous GNN should be jointly optimized rather than viewed as two separate processes. To address the above challenges, we propose a differentiable attribute completion framework called AutoAC for automated completion operation search in heterogeneous GNNs. We first propose an expressive completion operation search space, including topology-dependent and topology-independent completion operations. Then, we propose a continuous relaxation schema and further propose a differentiable completion algorithm where the completion operation search is formulated as a bi-level joint optimization problem. To improve the search efficiency, we leverage two optimization techniques: discrete constraints and auxiliary unsupervised graph node clustering. Extensive experimental results on real-world datasets reveal that AutoAC outperforms the SOTA handcrafted heterogeneous GNNs and the existing attribute completion method
    Subset verification and search algorithms for causal DAGs. (arXiv:2301.03180v1 [cs.LG])
    Learning causal relationships between variables is a fundamental task in causal inference and directed acyclic graphs (DAGs) are a popular choice to represent the causal relationships. As one can recover a causal graph only up to its Markov equivalence class from observations, interventions are often used for the recovery task. Interventions are costly in general and it is important to design algorithms that minimize the number of interventions performed. In this work, we study the problem of learning the causal relationships of a subset of edges (target edges) in a graph with as few interventions as possible. Under the assumptions of faithfulness, causal sufficiency, and ideal interventions, we study this problem in two settings: when the underlying ground truth causal graph is known (subset verification) and when it is unknown (subset search). For the subset verification problem, we provide an efficient algorithm to compute a minimum sized interventional set; we further extend these results to bounded size non-atomic interventions and node-dependent interventional costs. For the subset search problem, in the worst case, we show that no algorithm (even with adaptivity or randomization) can achieve an approximation ratio that is asymptotically better than the vertex cover of the target edges when compared with the subset verification number. This result is surprising as there exists a logarithmic approximation algorithm for the search problem when we wish to recover the whole causal graph. To obtain our results, we prove several interesting structural properties of interventional causal graphs that we believe have applications beyond the subset verification/search problems studied here.
    Advanced Data Augmentation Approaches: A Comprehensive Survey and Future directions. (arXiv:2301.02830v1 [cs.CV])
    Deep learning (DL) algorithms have shown significant performance in various computer vision tasks. However, having limited labelled data lead to a network overfitting problem, where network performance is bad on unseen data as compared to training data. Consequently, it limits performance improvement. To cope with this problem, various techniques have been proposed such as dropout, normalization and advanced data augmentation. Among these, data augmentation, which aims to enlarge the dataset size by including sample diversity, has been a hot topic in recent times. In this article, we focus on advanced data augmentation techniques. we provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique. We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation. For results reproducibility, we compiled available codes of all data augmentation techniques. Finally, we discuss the challenges and difficulties, and possible future direction for the research community. We believe, this survey provides several benefits i) readers will understand the data augmentation working mechanism to fix overfitting problems ii) results will save the searching time of the researcher for comparison purposes. iii) Codes of the mentioned data augmentation techniques are available at https://github.com/kmr2017/Advanced-Data-augmentation-codes iv) Future work will spark interest in research community.
    SCENE: Reasoning about Traffic Scenes using Heterogeneous Graph Neural Networks. (arXiv:2301.03512v1 [cs.CV])
    Understanding traffic scenes requires considering heterogeneous information about dynamic agents and the static infrastructure. In this work we propose SCENE, a methodology to encode diverse traffic scenes in heterogeneous graphs and to reason about these graphs using a heterogeneous Graph Neural Network encoder and task-specific decoders. The heterogeneous graphs, whose structures are defined by an ontology, consist of different nodes with type-specific node features and different relations with type-specific edge features. In order to exploit all the information given by these graphs, we propose to use cascaded layers of graph convolution. The result is an encoding of the scene. Task-specific decoders can be applied to predict desired attributes of the scene. Extensive evaluation on two diverse binary node classification tasks show the main strength of this methodology: despite being generic, it even manages to outperform task-specific baselines. The further application of our methodology to the task of node classification in various knowledge graphs shows its transferability to other domains.
    Efficient Attack Detection in IoT Devices using Feature Engineering-Less Machine Learning. (arXiv:2301.03532v1 [cs.CR])
    Through the generalization of deep learning, the research community has addressed critical challenges in the network security domain, like malware identification and anomaly detection. However, they have yet to discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often limited in memory and processing power, rendering the compute-intensive deep learning environment unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the deep learning pipeline and using raw packet data as input. We introduce a feature engineering-less machine learning (ML) process to perform malware detection on IoT devices. Our proposed model, "Feature engineering-less-ML (FEL-ML)," is a lighter-weight detection algorithm that expends no extra computations on "engineered" features. It effectively accelerates the low-powered IoT edge. It is trained on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added benefit of eliminating the significant investment by subject matter experts in feature engineering.
    Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization. (arXiv:2108.11832v2 [math.OC] UPDATED)
    We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle.
    Grokking modular arithmetic. (arXiv:2301.02679v1 [cs.LG])
    We present a simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as ``grokking''. Concretely, we present (i) fully-connected two-layer networks that exhibit grokking on various modular arithmetic tasks under vanilla gradient descent with the MSE loss function in the absence of any regularization; (ii) evidence that grokking modular arithmetic corresponds to learning specific feature maps whose structure is determined by the task; (iii) analytic expressions for the weights -- and thus for the feature maps -- that solve a large class of modular arithmetic tasks; and (iv) evidence that these feature maps are also found by vanilla gradient descent as well as AdamW, thereby establishing complete interpretability of the representations learnt by the network.
    Few-shot Node Classification with Extremely Weak Supervision. (arXiv:2301.02708v1 [cs.LG])
    Few-shot node classification aims at classifying nodes with limited labeled nodes as references. Recent few-shot node classification methods typically learn from classes with abundant labeled nodes (i.e., meta-training classes) and then generalize to classes with limited labeled nodes (i.e., meta-test classes). Nevertheless, on real-world graphs, it is usually difficult to obtain abundant labeled nodes for many classes. In practice, each meta-training class can only consist of several labeled nodes, known as the extremely weak supervision problem. In few-shot node classification, with extremely limited labeled nodes for meta-training, the generalization gap between meta-training and meta-test will become larger and thus lead to suboptimal performance. To tackle this issue, we study a novel problem of few-shot node classification with extremely weak supervision and propose a principled framework X-FNC under the prevalent meta-learning framework. Specifically, our goal is to accumulate meta-knowledge across different meta-training tasks with extremely weak supervision and generalize such knowledge to meta-test tasks. To address the challenges resulting from extremely scarce labeled nodes, we propose two essential modules to obtain pseudo-labeled nodes as extra references and effectively learn from extremely limited supervision information. We further conduct extensive experiments on four node classification datasets with extremely weak supervision to validate the superiority of our framework compared to the state-of-the-art baselines.
    Perceptual-Neural-Physical Sound Matching. (arXiv:2301.02886v1 [cs.SD])
    Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric domain, known as "P-loss", is simple and fast but fails to accommodate the differing perceptual significance of each parameter. On the other hand, mean square error in the spectrotemporal domain, known as "spectral loss", is perceptually motivated and serves in differentiable digital signal processing (DDSP). Yet, spectral loss has more local minima than P-loss and its gradient may be computationally expensive; hence a slow convergence. Against this conundrum, we present Perceptual-Neural-Physical loss (PNP). PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training. We instantiate PNP with physical modeling synthesis as decoder and joint time-frequency scattering transform (JTFS) as spectral representation. We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.
    On Consistency and Asymptotic Normality of Least Absolute Deviation Estimators for 2-dimensional Sinusoidal Model. (arXiv:2301.03229v1 [math.ST])
    Estimation of the parameters of a 2-dimensional sinusoidal model is a fundamental problem in digital signal processing. In this paper, we propose a robust least absolute deviation (LAD) estimators for parameter estimation. The proposed methodology provides a robust alternative to non-robust estimation techniques like the least squares estimators, in situations where outliers are present in the data or in the presence of heavy tailed noise. We study important asymptotic properties of the LAD estimators and establish the strong consistency and asymptotic normality of the LAD estimators. We further illustrate the advantage of using LAD estimators over least squares estimators through extensive simulation studies.
    AI Maintenance: A Robustness Perspective. (arXiv:2301.03052v1 [cs.LG])
    With the advancements in machine learning (ML) methods and compute resources, artificial intelligence (AI) empowered systems are becoming a prevailing technology. However, current AI technology such as deep learning is not flawless. The significantly increased model complexity and data scale incur intensified challenges when lacking trustworthiness and transparency, which could create new risks and negative impacts. In this paper, we carve out AI maintenance from the robustness perspective. We start by introducing some highlighted robustness challenges in the AI lifecycle and motivating AI maintenance by making analogies to car maintenance. We then propose an AI model inspection framework to detect and mitigate robustness risks. We also draw inspiration from vehicle autonomy to define the levels of AI robustness automation. Our proposal for AI maintenance facilitates robustness assessment, status tracking, risk scanning, model hardening, and regulation throughout the AI lifecycle, which is an essential milestone toward building sustainable and trustworthy AI ecosystems.
    Why do Nearest Neighbor Language Models Work?. (arXiv:2301.02828v1 [cs.CL])
    Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform a careful analysis of the various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate these insights into the model architecture or the training procedure of the standard parametric LM, improving its results without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.
    Isotonic Recalibration under a Low Signal-to-Noise Ratio. (arXiv:2301.02692v1 [stat.ME])
    Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to ensure auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to explainable pricing systems because the resulting isotonically recalibrated regression functions have a low complexity.
    BQ-NCO: Bisimulation Quotienting for Generalizable Neural Combinatorial Optimization. (arXiv:2301.03313v1 [cs.LG])
    Despite the success of Neural Combinatorial Optimization methods for end-to-end heuristic learning, out-of-distribution generalization remains a challenge. In this paper, we present a novel formulation of combinatorial optimization (CO) problems as Markov Decision Processes (MDPs) that effectively leverages symmetries of the CO problems to improve out-of-distribution robustness. Starting from the standard MDP formulation of constructive heuristics, we introduce a generic transformation based on bisimulation quotienting (BQ) in MDPs. This transformation allows to reduce the state space by accounting for the intrinsic symmetries of the CO problem and facilitates the MDP solving. We illustrate our approach on the Traveling Salesman, Capacitated Vehicle Routing and Knapsack Problems. We present a BQ reformulation of these problems and introduce a simple attention-based policy network that we train by imitation of (near) optimal solutions for small instances from a single distribution. We obtain new state-of-the-art generalization results for instances with up to 1000 nodes from synthetic and realistic benchmarks that vary both in size and node distributions.
    Modeling Scattering Coefficients in Antenna Design using Self-Attentive Complex Polynomials with Image-based Representation. (arXiv:2301.02747v1 [cs.LG])
    Finding antenna designs that satisfy frequency requirements and are also optimal with respect to multiple physical criteria is a critical component in designing next generation hardware. However, such a process is non-trivial because the objective function is typically highly nonlinear and sensitive to subtle design change. Moreover, the objective to be optimized often involves electromagnetic (EM) simulations, which is slow and expensive with commercial simulation software. In this work, we propose a sample-efficient and accurate surrogate model, named CZP (Constant Zeros Poles), to directly estimate the scattering coefficients in the frequency domain of a given 2D planar antenna design, without using a simulator. CZP achieves this by predicting the complex zeros and poles for the frequency response of scattering coefficients, which we have theoretically justified for any linear PDE, including Maxwell's equations. Moreover, instead of using low-dimensional representations, CZP leverages a novel image-based representation for antenna topology inspired by the existing mesh-based EM simulation techniques, and attention-based neural network architectures. We demonstrate experimentally that CZP not only outperforms baselines in terms of test loss, but also is able to find 2D antenna designs verifiable by commercial software with only 40k training samples, when coupling with advanced sequential search techniques like reinforcement learning.
    Principal Component Analysis in Space Forms. (arXiv:2301.02750v1 [stat.ML])
    Principal component analysis (PCA) is a workhorse of modern data science. Practitioners typically perform PCA assuming the data conforms to Euclidean geometry. However, for specific data types, such as hierarchical data, other geometrical spaces may be more appropriate. We study PCA in space forms; that is, those with constant positive (spherical) and negative (hyperbolic) curvatures, in addition to zero-curvature (Euclidean) spaces. At any point on a Riemannian manifold, one can define a Riemannian affine subspace based on a set of tangent vectors and use invertible maps to project tangent vectors to the manifold and vice versa. Finding a low-dimensional Riemannian affine subspace for a set of points in a space form amounts to dimensionality reduction because, as we show, any such affine subspace is isometric to a space form of the same dimension and curvature. To find principal components, we seek a (Riemannian) affine subspace that best represents a set of manifold-valued data points with the minimum average cost of projecting data points onto the affine subspace. We propose specific cost functions that bring about two major benefits: (1) the affine subspace can be estimated by solving an eigenequation -- similar to that of Euclidean PCA, and (2) optimal affine subspaces of different dimensions form a nested set. These properties provide advances over existing methods which are mostly iterative algorithms with slow convergence and weaker theoretical guarantees. Specifically for hyperbolic PCA, the associated eigenequation operates in the Lorentzian space, endowed with an indefinite inner product; we thus establish a connection between Lorentzian and Euclidean eigenequations. We evaluate the proposed space form PCA on data sets simulated in spherical and hyperbolic spaces and show that it outperforms alternative methods in convergence speed or accuracy, often both.
    Visual Story Generation Based on Emotion and Keywords. (arXiv:2301.02777v1 [cs.AI])
    Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development. This work proposes a story generation pipeline to co-create visual stories with the users. The pipeline allows the user to control events and emotions on the generated content. The pipeline includes two parts: narrative and image generation. For narrative generation, the system generates the next sentence using user-specified keywords and emotion labels. For image generation, diffusion models are used to create a visually appealing image corresponding to each generated sentence. Further, object recognition is applied to the generated images to allow objects in these images to be mentioned in future story development.
    Discovery of structure-property relations for molecules via hypothesis-driven active learning over the chemical space. (arXiv:2301.02665v1 [cs.LG])
    Discovery of the molecular candidates for applications in drug targets, biomolecular systems, catalysts, photovoltaics, organic electronics, and batteries, necessitates development of machine learning algorithms capable of rapid exploration of the chemical spaces targeting the desired functionalities. Here we introduce a novel approach for the active learning over the chemical spaces based on hypothesis learning. We construct the hypotheses on the possible relationships between structures and functionalities of interest based on a small subset of data and introduce them as (probabilistic) mean functions for the Gaussian process. This approach combines the elements from the symbolic regression methods such as SISSO and active learning into a single framework. Here, we demonstrate it for the QM9 dataset, but it can be applied more broadly to datasets from both domains of molecular and solid-state materials sciences.
    Multimodal Lyrics-Rhythm Matching. (arXiv:2301.02732v1 [cs.SD])
    Despite the recent increase in research on artificial intelligence for music, prominent correlations between key components of lyrics and rhythm such as keywords, stressed syllables, and strong beats are not frequently studied. Ths is likely due to challenges such as audio misalignment, inaccuracies in syllabic identification, and most importantly, the need for cross-disciplinary knowledge. To address this lack of research, we propose a novel multimodal lyrics-rhythm matching approach in this paper that specifically matches key components of lyrics and music with each other without any language limitations. We use audio instead of sheet music with readily available metadata, which creates more challenges yet increases the application flexibility of our method. Furthermore, our approach creatively generates several patterns involving various multimodalities, including music strong beats, lyrical syllables, auditory changes in a singer's pronunciation, and especially lyrical keywords, which are utilized for matching key lyrical elements with key rhythmic elements. This advantageous approach not only provides a unique way to study auditory lyrics-rhythm correlations including efficient rhythm-based audio alignment algorithms, but also bridges computational linguistics with music as well as music cognition. Our experimental results reveal an 0.81 probability of matching on average, and around 30% of the songs have a probability of 0.9 or higher of keywords landing on strong beats, including 12% of the songs with a perfect landing. Also, the similarity metrics are used to evaluate the correlation between lyrics and rhythm. It shows that nearly 50% of the songs have 0.70 similarity or higher. In conclusion, our approach contributes significantly to the lyrics-rhythm relationship by computationally unveiling insightful correlations.
  • Open

    A Characterization of Multilabel Learnability. (arXiv:2301.02729v1 [cs.LG])
    We consider the problem of multilabel classification and investigate learnability in batch and online settings. In both settings, we show that a multilabel function class is learnable if and only if each single-label restriction of the function class is learnable. As extensions, we also study multioutput regression in the batch setting and bandit feedback in the online setting. For the former, we characterize learnability w.r.t. $L_p$ losses. For the latter, we show a similar characterization as in the full-feedback setting.
    Upward lightning at wind turbines: Risk assessment from larger-scale meteorology. (arXiv:2301.03360v1 [stat.ML])
    Upward lightning (UL) has become an increasingly important threat to wind turbines as ever more of them are being installed for renewably producing electricity. The taller the wind turbine the higher the risk that the type of lightning striking the man-made structure is UL. UL can be much more destructive than downward lightning due to its long lasting initial continuous current leading to a large charge transfer within the lightning discharge process. Current standards for the risk assessment of lightning at wind turbines mainly take the summer lightning activity into account, which is inferred from LLS. Ground truth lightning current measurements reveal that less than 50% of UL might be detected by lightning location systems (LLS). This leads to a large underestimation of the proportion of LLS-non-detectable UL at wind turbines, which is the dominant lightning type in the cold season. This study aims to assess the risk of LLS-detectable and LLS-non-detectable UL at wind turbines using direct UL measurements at the Gaisberg Tower (Austria) and S\"antis Tower (Switzerland). Direct UL observations are linked to meteorological reanalysis data and joined by random forests, a powerful machine learning technique. The meteorological drivers for the non-/occurrence of LLS-detectable and LLS-non-detectable UL, respectively, are found from the random forest models trained at the towers and have large predictive skill on independent data. In a second step the results from the tower-trained models are extended to a larger study domain (Central and Northern Germany). The tower-trained models for LLS-detectable lightning is independently verified at wind turbine locations in that domain and found to reliably diagnose that type of UL. Risk maps based on case study events show that high diagnosed probabilities in the study domain coincide with actual UL events.
    Generalized Kernel Regularized Least Squares. (arXiv:2209.14355v2 [stat.ML] UPDATED)
    Kernel Regularized Least Squares (KRLS) is a popular method for flexibly estimating models that may have complex relationships between variables. However, its usefulness to many researchers is limited for two reasons. First, existing approaches are inflexible and do not allow KRLS to be combined with theoretically-motivated extensions such as random effects, unregularized fixed effects, or non-Gaussian outcomes. Second, estimation is extremely computationally intensive for even modestly sized datasets. Our paper addresses both concerns by introducing generalized KRLS (gKRLS). We note that KRLS can be re-formulated as a hierarchical model thereby allowing easy inference and modular model construction where KRLS can be used alongside random effects, splines, and unregularized fixed effects. Computationally, we also implement random sketching to dramatically accelerate estimation while incurring a limited penalty in estimation quality. We demonstrate that gKRLS can be fit on datasets with tens of thousands of observations in under one minute. Further, state-of-the-art techniques that require fitting the model over a dozen times (e.g. meta-learners) can be estimated quickly.
    Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation. (arXiv:2301.03125v1 [stat.ML])
    The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term. In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we provide a two-phase extension of M-SPP to achieve comparable convergence rates. Moreover, we derive a near-tight high probability (over the randomness of data) bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.
    Beyond calibration: estimating the grouping loss of modern neural networks. (arXiv:2210.16315v2 [cs.LG] UPDATED)
    The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.  ( 2 min )
    Principal Component Analysis in Space Forms. (arXiv:2301.02750v1 [stat.ML])
    Principal component analysis (PCA) is a workhorse of modern data science. Practitioners typically perform PCA assuming the data conforms to Euclidean geometry. However, for specific data types, such as hierarchical data, other geometrical spaces may be more appropriate. We study PCA in space forms; that is, those with constant positive (spherical) and negative (hyperbolic) curvatures, in addition to zero-curvature (Euclidean) spaces. At any point on a Riemannian manifold, one can define a Riemannian affine subspace based on a set of tangent vectors and use invertible maps to project tangent vectors to the manifold and vice versa. Finding a low-dimensional Riemannian affine subspace for a set of points in a space form amounts to dimensionality reduction because, as we show, any such affine subspace is isometric to a space form of the same dimension and curvature. To find principal components, we seek a (Riemannian) affine subspace that best represents a set of manifold-valued data points with the minimum average cost of projecting data points onto the affine subspace. We propose specific cost functions that bring about two major benefits: (1) the affine subspace can be estimated by solving an eigenequation -- similar to that of Euclidean PCA, and (2) optimal affine subspaces of different dimensions form a nested set. These properties provide advances over existing methods which are mostly iterative algorithms with slow convergence and weaker theoretical guarantees. Specifically for hyperbolic PCA, the associated eigenequation operates in the Lorentzian space, endowed with an indefinite inner product; we thus establish a connection between Lorentzian and Euclidean eigenequations. We evaluate the proposed space form PCA on data sets simulated in spherical and hyperbolic spaces and show that it outperforms alternative methods in convergence speed or accuracy, often both.  ( 2 min )
    Isotonic Recalibration under a Low Signal-to-Noise Ratio. (arXiv:2301.02692v1 [stat.ME])
    Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to ensure auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to explainable pricing systems because the resulting isotonically recalibrated regression functions have a low complexity.  ( 2 min )
    Improved Training of Physics-Informed Neural Networks with Model Ensembles. (arXiv:2204.05108v2 [cs.LG] UPDATED)
    Learning the solution of partial differential equations (PDEs) with a neural network (known in the literature as a physics-informed neural network, PINN) is an attractive alternative to traditional solvers due to its elegancy, greater flexibility and the ease of incorporating observed data. However, training PINNs is notoriously difficult in practice. One problem is the existence of multiple simple (but wrong) solutions which are attractive for PINNs when the solution interval is too large. In this paper, we propose to expand the solution interval gradually to make the PINN converge to the correct solution. To find a good schedule for the solution interval expansion, we train an ensemble of PINNs. The idea is that all ensemble members converge to the same solution in the vicinity of observed data (e.g., initial conditions) while they may be pulled towards different wrong solutions farther away from the observations. Therefore, we use the ensemble agreement as the criterion for including new points for computing the loss derived from PDEs. We show experimentally that the proposed method can improve the accuracy of the found solution.  ( 2 min )
    Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification. (arXiv:2205.13573v3 [cs.LG] UPDATED)
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to the high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called \textsc{Spar-GW}, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. The proposed \textsc{Spar-GW} method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $O(n^4)$ to $O(n^{2+\delta})$ for an arbitrary small $\delta>0$. Theoretically, the convergence and consistency of the proposed estimation for GW distance are established under mild regularity conditions. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our \textsc{Spar-GW} to state-of-the-art methods in both synthetic and real-world tasks.  ( 2 min )
    Making Decisions under Outcome Performativity. (arXiv:2210.01745v2 [cs.LG] UPDATED)
    Decision-makers often act in response to data-driven predictions, with the goal of achieving favorable outcomes. In such settings, predictions don't passively forecast the future; instead, predictions actively shape the distribution of outcomes they are meant to predict. This performative prediction setting raises new challenges for learning "optimal" decision rules. In particular, existing solution concepts do not address the apparent tension between the goals of forecasting outcomes accurately and steering individuals to achieve desirable outcomes. To contend with this concern, we introduce a new optimality concept -- performative omniprediction -- adapted from the supervised (non-performative) learning setting. A performative omnipredictor is a single predictor that simultaneously encodes the optimal decision rule with respect to many possibly-competing objectives. Our main result demonstrates that efficient performative omnipredictors exist, under a natural restriction of performative prediction, which we call outcome performativity. On a technical level, our results follow by carefully generalizing the notion of outcome indistinguishability to the outcome performative setting. From an appropriate notion of Performative OI, we recover many consequences known to hold in the supervised setting, such as omniprediction and universal adaptability.  ( 2 min )
    EMAHA-DB1: A New Upper Limb sEMG Dataset for Classification of Activities of Daily Living. (arXiv:2301.03325v1 [eess.SP])
    In this paper, we present electromyography analysis of human activity - database 1 (EMAHA-DB1), a novel dataset of multi-channel surface electromyography (sEMG) signals to evaluate the activities of daily living (ADL). The dataset is acquired from 25 able-bodied subjects while performing 22 activities categorised according to functional arm activity behavioral system (FAABOS) (3 - full hand gestures, 6 - open/close office draw, 8 - grasping and holding of small office objects, 2 - flexion and extension of finger movements, 2 - writing and 1 - rest). The sEMG data is measured by a set of five Noraxon Ultium wireless sEMG sensors with Ag/Agcl electrodes placed on a human hand. The dataset is analyzed for hand activity recognition classification performance. The classification is performed using four state-ofthe-art machine learning classifiers, including Random Forest (RF), Fine K-Nearest Neighbour (KNN), Ensemble KNN (sKNN) and Support Vector Machine (SVM) with seven combinations of time domain and frequency domain feature sets. The state-of-theart classification accuracy on five FAABOS categories is 83:21% by using the SVM classifier with the third order polynomial kernel using energy feature and auto regressive feature set ensemble. The classification accuracy on 22 class hand activities is 75:39% by the same SVM classifier with the log moments in frequency domain (LMF) feature, modified LMF, time domain statistical (TDS) feature, spectral band powers (SBP), channel cross correlation and local binary patterns (LBP) set ensemble. The analysis depicts the technical challenges addressed by the dataset. The developed dataset can be used as a benchmark for various classification methods as well as for sEMG signal analysis corresponding to ADL and for the development of prosthetics and other wearable robotics.  ( 2 min )
    A Newton-CG based augmented Lagrangian method for finding a second-order stationary point of nonconvex equality constrained optimization with complexity guarantees. (arXiv:2301.03139v1 [math.OC])
    In this paper we consider finding a second-order stationary point (SOSP) of nonconvex equality constrained optimization when a nearly feasible point is known. In particular, we first propose a new Newton-CG method for finding an approximate SOSP of unconstrained optimization and show that it enjoys a substantially better complexity than the Newton-CG method [56]. We then propose a Newton-CG based augmented Lagrangian (AL) method for finding an approximate SOSP of nonconvex equality constrained optimization, in which the proposed Newton-CG method is used as a subproblem solver. We show that under a generalized linear independence constraint qualification (GLICQ), our AL method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-7/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of nonconvex equality constrained optimization with high probability, which are significantly better than the ones achieved by the proximal AL method [60]. Besides, we show that it has a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ when the GLICQ does not hold. To the best of our knowledge, all the complexity results obtained in this paper are new for finding an approximate SOSP of nonconvex equality constrained optimization with high probability. Preliminary numerical results also demonstrate the superiority of our proposed methods over the ones in [56,60].  ( 2 min )
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation. (arXiv:2203.05400v3 [math.ST] UPDATED)
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, including the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood estimate of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 + d/2$, then the smoothness parameter estimate cannot be asymptotically less than $\nu_0 + d/2$. The lower bound is sharp. Additionally, we show that maximum likelihood estimation finds the "correct" smoothness for a class of compactly supported self-similar functions. We also consider cross-validation and prove an asymptotic lower bound $\nu_0$, which however is unlikely to be sharp. The results are based on approximation theory in Sobolev spaces and some general theorems that restrict the set of values that the parameter estimators can take.  ( 2 min )
    Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference. (arXiv:2207.11597v3 [cs.LG] UPDATED)
    We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound rather than a logarithmic lower bound, as shown by \cite{lattimore2017end}, in discrete (i.e., well-separated) action spaces. Furthermore, while the previous result is shown to hold only in the asymptotic regime (as $n \to \infty$), our result for these "locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary -- the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.  ( 2 min )
    Stochastic Langevin Monte Carlo for (weakly) log-concave posterior distributions. (arXiv:2301.03077v1 [stat.ML])
    In this paper, we investigate a continuous time version of the Stochastic Langevin Monte Carlo method, introduced in [WT11], that incorporates a stochastic sampling step inside the traditional over-damped Langevin diffusion. This method is popular in machine learning for sampling posterior distribution. We will pay specific attention in our work to the computational cost in terms of $n$ (the number of observations that produces the posterior distribution), and $d$ (the dimension of the ambient space where the parameter of interest is living). We derive our analysis in the weakly convex framework, which is parameterized with the help of the Kurdyka-\L ojasiewicz (KL) inequality, that permits to handle a vanishing curvature settings, which is far less restrictive when compared to the simple strongly convex case. We establish that the final horizon of simulation to obtain an $\varepsilon$ approximation (in terms of entropy) is of the order $( d \log(n)^2 )^{(1+r)^2} [\log^2(\varepsilon^{-1}) + n^2 d^{2(1+r)} \log^{4(1+r)}(n) ]$ with a Poissonian subsampling of parameter $\left(n ( d \log^2(n))^{1+r}\right)^{-1}$, where the parameter $r$ is involved in the KL inequality and varies between $0$ (strongly convex case) and $1$ (limiting Laplace situation).  ( 2 min )
    Exploration in Model-based Reinforcement Learning with Randomized Reward. (arXiv:2301.03142v1 [stat.ML])
    Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.  ( 2 min )
    Online Centralized Non-parametric Change-point Detection via Graph-based Likelihood-ratio Estimation. (arXiv:2301.03011v1 [stat.ML])
    Consider each node of a graph to be generating a data stream that is synchronized and observed at near real-time. At a change-point $\tau$, a change occurs at a subset of nodes $C$, which affects the probability distribution of their associated node streams. In this paper, we propose a novel kernel-based method to both detect $\tau$ and localize $C$, based on the direct estimation of the likelihood-ratio between the post-change and the pre-change distributions of the node streams. Our main working hypothesis is the smoothness of the likelihood-ratio estimates over the graph, i.e connected nodes are expected to have similar likelihood-ratios. The quality of the proposed method is demonstrated on extensive experiments on synthetic scenarios.  ( 2 min )
    On Consistency and Asymptotic Normality of Least Absolute Deviation Estimators for 2-dimensional Sinusoidal Model. (arXiv:2301.03229v1 [math.ST])
    Estimation of the parameters of a 2-dimensional sinusoidal model is a fundamental problem in digital signal processing. In this paper, we propose a robust least absolute deviation (LAD) estimators for parameter estimation. The proposed methodology provides a robust alternative to non-robust estimation techniques like the least squares estimators, in situations where outliers are present in the data or in the presence of heavy tailed noise. We study important asymptotic properties of the LAD estimators and establish the strong consistency and asymptotic normality of the LAD estimators. We further illustrate the advantage of using LAD estimators over least squares estimators through extensive simulation studies.  ( 2 min )
    The Optimal Input-Independent Baseline for Binary Classification: The Dutch Draw. (arXiv:2301.03318v1 [cs.LG])
    Before any binary classification model is taken into practice, it is important to validate its performance on a proper test set. Without a frame of reference given by a baseline method, it is impossible to determine if a score is `good' or `bad'. The goal of this paper is to examine all baseline methods that are independent of feature values and determine which model is the `best' and why. By identifying which baseline models are optimal, a crucial selection decision in the evaluation process is simplified. We prove that the recently proposed Dutch Draw baseline is the best input-independent classifier (independent of feature values) for all positional-invariant measures (independent of sequence order) assuming that the samples are randomly shuffled. This means that the Dutch Draw baseline is the optimal baseline under these intuitive requirements and should therefore be used in practice.  ( 2 min )
    A Sublinear-Time Quantum Algorithm for Approximating Partition Functions. (arXiv:2207.08643v2 [quant-ph] UPDATED)
    We present a novel quantum algorithm for estimating Gibbs partition functions in sublinear time with respect to the logarithm of the size of the state space. This is the first speed-up of this type to be obtained over the seminal nearly-linear time algorithm of \v{S}tefankovi\v{c}, Vempala and Vigoda [JACM, 2009]. Our result also preserves the quadratic speed-up in precision and spectral gap achieved in previous work by exploiting the properties of quantum Markov chains. As an application, we obtain new polynomial improvements over the best-known algorithms for computing the partition function of the Ising model, counting the number of $k$-colorings, matchings or independent sets of a graph, and estimating the volume of a convex body. Our approach relies on developing new variants of the quantum phase and amplitude estimation algorithms that return nearly unbiased estimates with low variance and without destroying their initial quantum state. We extend these subroutines into a nearly unbiased quantum mean estimator that reduces the variance quadratically faster than the classical empirical mean. No such estimator was known to exist prior to our work. These properties, which are of general interest, lead to better convergence guarantees within the paradigm of simulated annealing for computing partition functions.  ( 2 min )
    Subset verification and search algorithms for causal DAGs. (arXiv:2301.03180v1 [cs.LG])
    Learning causal relationships between variables is a fundamental task in causal inference and directed acyclic graphs (DAGs) are a popular choice to represent the causal relationships. As one can recover a causal graph only up to its Markov equivalence class from observations, interventions are often used for the recovery task. Interventions are costly in general and it is important to design algorithms that minimize the number of interventions performed. In this work, we study the problem of learning the causal relationships of a subset of edges (target edges) in a graph with as few interventions as possible. Under the assumptions of faithfulness, causal sufficiency, and ideal interventions, we study this problem in two settings: when the underlying ground truth causal graph is known (subset verification) and when it is unknown (subset search). For the subset verification problem, we provide an efficient algorithm to compute a minimum sized interventional set; we further extend these results to bounded size non-atomic interventions and node-dependent interventional costs. For the subset search problem, in the worst case, we show that no algorithm (even with adaptivity or randomization) can achieve an approximation ratio that is asymptotically better than the vertex cover of the target edges when compared with the subset verification number. This result is surprising as there exists a logarithmic approximation algorithm for the search problem when we wish to recover the whole causal graph. To obtain our results, we prove several interesting structural properties of interventional causal graphs that we believe have applications beyond the subset verification/search problems studied here.  ( 2 min )
    Batch Bayesian Optimization via Particle Gradient Flows. (arXiv:2209.04722v2 [stat.ML] UPDATED)
    Bayesian Optimisation (BO) methods seek to find global optima of objective functions which are only available as a black-box or are expensive to evaluate. Such methods construct a surrogate model for the objective function, quantifying the uncertainty in that surrogate through Bayesian inference. Objective evaluations are sequentially determined by maximising an acquisition function at each step. However, this ancilliary optimisation problem can be highly non-trivial to solve, due to the non-convexity of the acquisition function, particularly in the case of batch Bayesian optimisation, where multiple points are selected in every step. In this work we reformulate batch BO as an optimisation problem over the space of probability measures. We construct a new acquisition function based on multipoint expected improvement which is convex over the space of probability measures. Practical schemes for solving this `inner' optimisation problem arise naturally as gradient flows of this objective function. We demonstrate the efficacy of this new method on different benchmark functions and compare with state-of-the-art batch BO methods.
    Accelerated Randomized Block-Coordinate Algorithms for Co-coercive Equations and Applications. (arXiv:2301.03113v1 [math.OC])
    In this paper, we develop an accelerated randomized block-coordinate algorithm to approximate a solution of a co-coercive equation. Such an equation plays a central role in optimization and related fields and covers many mathematical models as special cases, including convex optimization, convex-concave minimax, and variational inequality problems. Our algorithm relies on a recent Nesterov's accelerated interpretation of the Halpern fixed-point iteration in [48]. We establish that the new algorithm achieves $\mathcal{O}(1/k^2)$-convergence rate on $\mathbb{E}[\Vert Gx^k\Vert^2]$ through the last-iterate, where $G$ is the underlying co-coercive operator, $\mathbb{E}[\cdot]$ is the expectation, and $k$ is the iteration counter. This rate is significantly faster than $\mathcal{O}(1/k)$ rates in standard forward or gradient-based methods from the literature. We also prove $o(1/k^2)$ rates on both $\mathbb{E}[\Vert Gx^k\Vert^2]$ and $\mathbb{E}[\Vert x^{k+1} - x^{k}\Vert^2]$. Next, we apply our method to derive two accelerated randomized block coordinate variants of the forward-backward splitting and Douglas-Rachford splitting schemes, respectively for solving a monotone inclusion involving the sum of two operators. As a byproduct, these variants also have faster convergence rates than their non-accelerated counterparts. Finally, we apply our scheme to a finite-sum monotone inclusion that has various applications in machine learning and statistical learning, including federated learning. As a result, we obtain a novel federated learning-type algorithm with fast and provable convergence rates.  ( 2 min )
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v3 [cs.LG] UPDATED)
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.  ( 3 min )
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v2 [q-bio.NC] UPDATED)
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.  ( 2 min )
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v5 [cs.LG] UPDATED)
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with $G$ a group, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of ``$G$-invariant neural architecture design'': What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or ``shallow'' neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$, and we prove a second theorem that characterizes the inclusion maps or ``network morphisms'' between the architectures that can be leveraged during neural architecture search (NAS). The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons; the classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. The $G$-SNN architectures corresponding to nontrivial cohomology classes have, to our knowledge, never been explicitly identified in the literature previously. Using a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. Finally, we prove that architectures corresponding to inequivalent cohomology classes coincide in function space only when their weight matrices are zero, and we discuss the implications of this for NAS.  ( 3 min )
    Exponential Family Model-Based Reinforcement Learning via Score Matching. (arXiv:2112.14195v2 [cs.LG] UPDATED)
    We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Under standard regularity assumptions, SMRL achieves $\tilde O(d\sqrt{H^3T})$ online regret, where $H$ is the length of each episode and $T$ is the total number of interactions (ignoring polynomial dependence on structural scale parameters).  ( 2 min )
    Meta-Analysis of Randomized Experiments with Applications to Heavy-Tailed Response Data. (arXiv:2112.07602v5 [stat.ME] UPDATED)
    A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. In this paper, we propose a novel cross-validation-like methodology to address this challenge. The key insight of our procedure is that the noisy (but unbiased) difference-of-means estimate can be used as a ground truth ``label" on a portion of the RCT, to test the performance of an estimator trained on the other portion. We combine this insight with an aggregation scheme, which borrows statistical strength across a large collection of RCTs, to present an end-to-end methodology for judging an estimator's ability to recover the underlying treatment effect as well as produce an optimal treatment "roll out" policy. We evaluate our methodology across 699 RCTs implemented in the Amazon supply chain. In this heavy-tailed setting, our methodology suggests that procedures that aggressively downweight or truncate large values, while introducing bias, lower the variance enough to ensure that the treatment effect is more accurately estimated.  ( 2 min )
    Wasserstein Iterative Networks for Barycenter Estimation. (arXiv:2201.12245v2 [cs.LG] UPDATED)
    Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba! dataset which can be used for quantitative evaluation of barycenter algorithms by using standard metrics of generative models such as FID.  ( 2 min )
    Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods. (arXiv:2205.01303v3 [stat.ML] UPDATED)
    In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a "converse" Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).  ( 2 min )
    Optimization-based Causal Estimation from Heterogenous Environments. (arXiv:2109.11990v2 [stat.ME] UPDATED)
    This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.  ( 2 min )
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v4 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.  ( 2 min )
    Reinforcement Learning for Joint Optimization of Multiple Rewards. (arXiv:1909.02940v4 [cs.LG] UPDATED)
    Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.  ( 2 min )
    PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks. (arXiv:2006.07794v2 [cs.LG] UPDATED)
    Large capacity deep learning models are often prone to a high generalization gap when trained with a limited amount of labeled training data. A recent class of methods to address this problem uses various ways to construct a new training sample by mixing a pair (or more) of training samples. We propose PatchUp, a hidden state block-level regularization technique for Convolutional Neural Networks (CNNs), that is applied on selected contiguous blocks of feature maps from a random pair of samples. Our approach improves the robustness of CNN models against the manifold intrusion problem that may occur in other state-of-the-art mixing approaches. Moreover, since we are mixing the contiguous block of features in the hidden space, which has more dimensions than the input space, we obtain more diverse samples for training towards different dimensions. Our experiments on CIFAR10/100, SVHN, Tiny-ImageNet, and ImageNet using ResNet architectures including PreActResnet18/34, WRN-28-10, ResNet101/152 models show that PatchUp improves upon, or equals, the performance of current state-of-the-art regularizers for CNNs. We also show that PatchUp can provide a better generalization to deformed samples and is more robust against adversarial attacks.  ( 2 min )
    Differentially private inference via noisy optimization. (arXiv:2103.11003v3 [math.ST] UPDATED)
    We propose a general optimization-based framework for computing differentially private M-estimators and a new method for constructing differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent or noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish local and global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a nearly optimal neighborhood of the non-private M-estimators. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to approximate pivotal statistics for constructing confidence regions and conducting hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods in several numerical examples.  ( 2 min )
    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. (arXiv:2301.03566v1 [math.ST])
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.  ( 2 min )
    Concentration of measure and generalized product of random vectors with an application to Hanson-Wright-like inequalities. (arXiv:2102.08020v3 [math.PR] UPDATED)
    Starting from concentration of measure hypotheses on $m$ random vectors $Z_1,\ldots, Z_m$, this article provides an expression of the concentration of functionals $\phi(Z_1,\ldots, Z_m)$ where the variations of $\phi$ on each variable depend on the product of the norms (or semi-norms) of the other variables (as if $\phi$ were a product). We illustrate the importance of this result through various generalizations of the Hanson-Wright concentration inequality as well as through a study of the random matrix $XDX^T$ and its resolvent $Q = (I_p - \frac{1}{n}XDX^T)^{-1}$, where $X$ and $D$ are random, which have fundamental interest in statistical machine learning applications.  ( 2 min )
    Spectral properties of sample covariance matrices arising from random matrices with independent non identically distributed columns. (arXiv:2109.02644v3 [math.PR] UPDATED)
    Given a random matrix $X= (x_1,\ldots, x_n)\in \mathcal M_{p,n}$ with independent columns and satisfying concentration of measure hypotheses and a parameter $z$ whose distance to the spectrum of $\frac{1}{n} XX^T$ should not depend on $p,n$, it was previously shown that the functionals $\text{tr}(AR(z))$, for $R(z) = (\frac{1}{n}XX^T- zI_p)^{-1}$ and $A\in \mathcal M_{p}$ deterministic, have a standard deviation of order $O(\|A\|_* / \sqrt n)$. Here, we show that $\|\mathbb E[R(z)] - \tilde R(z)\|_F \leq O(1/\sqrt n)$, where $\tilde R(z)$ is a deterministic matrix depending only on $z$ and on the means and covariances of the column vectors $x_1,\ldots, x_n$ (that do not have to be identically distributed). This estimation is key to providing accurate fluctuation rates of functionals of $X$ of interest (mostly related to its spectral properties) and is proved thanks to the introduction of a semi-metric $d_s$ defined on the set $\mathcal D_n(\mathbb H)$ of diagonal matrices with complex entries and positive imaginary part and satisfying, for all $D,D' \in \mathcal D_n(\mathbb H)$: $d_s(D,D') = \max_{i\in[n]} |D_i - D_i'|/ (\Im(D_i) \Im(D_i'))^{1/2}$. Possibly most importantly, the underlying concentration of measure assumption on the columns of $X$ finds an extremely natural ground for application in modern statistical machine learning algorithms where non-linear Lipschitz mappings and high number of classes form the base ingredients.  ( 2 min )

  • Open

    [D] Form on sharing ML codes
    Hello everyone, I would kindly ask you if you could help me get some insights about people’s preferences when sharing ML codes, with special focus on neural networks. I am here linking a very quick Google Form. Please, feel free to reach out. https://forms.gle/4zg5HLqLaEESuVTz9 submitted by /u/Fc3692 [link] [comments]  ( 57 min )
    [D] Soft Prompt Training Issue
    I am implementing soft prompt tuning (reproducing https://arxiv.org/abs/2104.08691v2) for my research project, but the training makes the model predict "False" only in T/F classification task (BoolQ dataset). I have tried all other code on full model fine-tuning and it's working to exclude all other issues (so it's unrelated to dataset and trainer). ​ There are some observations to exclude possible issues. The soft prompt parameters do change during soft prompt training. (gradient backprop on soft prompt is working) Training loss goes down normally just as model fine-tuning Any idea on how to debug the issue. submitted by /u/SEAIndigenous [link] [comments]  ( 57 min )
    [N] Microsoft Considers $10 Billion Investment in ChatGPT Creator --Bloomberg News
    Story here: https://www.bloomberg.com/news/articles/2023-01-10/microsoft-weighs-10-billion-chatgpt-investment-semafor-says?srnd=premium Unpaywalled: https://archive.ph/XOOlg submitted by /u/bikeskata [link] [comments]  ( 61 min )
    [D] Found very similar paper to my submitted paper on Arxiv
    If the mods want to ban this because it falls outside of meaningful discussion, that’s ok. I have a paper in the review process for CVPR atm. A couple of hours ago I stumbled upon an Arxiv paper upload 2 days ago that replicates my method almost exactly save for a few differences in how the inputs are processed, and how the problem is defined (super super similar problems). Their paper achieves far better results than mine, is tested on more datasets than mine, and comes from a big well known research group in my field to boot. I guess I feel a bit dejected? The approach was truly novel and nobody had done it before. Even with my limited training, was showing very promising results. I couldn’t train for longer to improve the model further due to a lack of hardware/budget, and I couldn’t test on more datasets for the same reason. It’ll probably get rejected from cvpr for those very reasons. I’m not complaining about this, it was my decision to submit there and take the chance, but damn. In hindsight I should’ve maybe gone for an easier journal or something and at least be guaranteed to be the first. 😔 Sorry if that was a bit of a rant, I just figure people here can relate a bit. submitted by /u/TightestKnees [link] [comments]  ( 64 min )
    [P] Evaluating several topic modeling implementations. What's the current best practice? BERTopic? OpenAI Ada-002?
    I have a set of ~100 topic categories, and I want to determine which are semantically close to a text input. I've found several implementations, but I know some (LDA) are already obsolete. OpenAI's text-embedding-ada-002 model just came out so I'm wondering if that's the best option now. Other topic modeling implementations: Multi-Class Text Classification with Doc2Vec & Logistic Regression Build taxonomy-based contextual targeting using AWS Media Intelligence and Hugging Face BERT Topic Modeling with BERTopic submitted by /u/gravenbirdman [link] [comments]  ( 60 min )
    [R] Class-Continuous Conditional Generative Nerual Radiance Field
    Paper: https://arxiv.org/abs/2301.00950 Project Page: https://tom919654.github.io/C3G_NeRF/ (Videos included) Code: https://github.com/tom919654/C3G-NeRF ​ Abstract: The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently,Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called ClassContinuous Conditional Generative NeRF (C3GNeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed C3GNeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, C3G-NeRF exhibits a Frechet Inception Distance (FID) of 7.64 in 3D- aware face image synthesis with a 1282 resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with C3G-NeRF. ​ Results: https://preview.redd.it/9y8b316dx4ba1.png?width=1750&format=png&auto=webp&s=8470793ff3974932058e1bd196ba7a857375f22f ​ https://preview.redd.it/0icxw8blx4ba1.png?width=1595&format=png&auto=webp&s=4a7c7226d7a9c5a2b28492aca8157c7037b4a647 https://reddit.com/link/107z8z1/video/ka6cqf1gx4ba1/player ​ https://reddit.com/link/107z8z1/video/2lm0bamix4ba1/player submitted by /u/JiwookKim [link] [comments]  ( 60 min )
    [D] Sample Average Approximation-Samples to be used
    Considering a Machine Learning scenario with some pre-available training samples S. In the objective function, let's suppose we have expectation over some reference distribution P_0 whose parameter (example, mean) has been approximated based on training samples S. When performing Sample Average Approximation for that expectation, is it necessary that we sample from the distribution of interest P_0 or can we directly use the training samples that we have? Could you please help me understand this? submitted by /u/RecentUnicorn [link] [comments]  ( 57 min )
  • Open

    "Comments on the Origin and Application of Markov Decision Processes", Howard 2002 (optimizing Sears Catalogue mailings ~1959 with value iteration & inventing policy iteration)
    submitted by /u/gwern [link] [comments]  ( 59 min )
    Episode Q0 is decreasing while cumulative reward increases (and doesn't converge to an optimal policy)
    I am using Matlab Simulink/Simscape to train a two-wheeled balancing robot to balance using a DDPG agent. I've tried tuning hyperparameters like learning rate, discount factor, and mini batch size to no avail. I've tweaked my reward function many times, and I feel like it's alright. For some reason my episode Q0 decreases when my reward improves. I believe this indicates that the critic and actor are disagreeing. Right? Does anybody have suggestions? If need be, I can include my script. Training Curve (orange is the episode Q0 and blue is the episode cumulative reward) submitted by /u/FenderBender43 [link] [comments]  ( 24 min )
    Let’s learn how to use Unity ML-Agents and train a bear 🐻 to shoot snowballs (Deep Reinforcement Learning Free Course by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the fifth Unit of the Deep Reinforcement Learning Course 🥳 In this Unit, we’ll learn to use the Unity ML-Agents library by training two agents: The first one will learn to shoot snowballs at the spawning target. The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top. To do that, it will need to explore its environment, and we will use a technique called curiosity. Then, after training, you’ll push the trained agents to the Hugging Face Hub, and you’ll be able to visualize it playing directly on your browser without having to use the Unity Editor Start Learning now 👉 https://huggingface.co/deep-rl-course/unit5/introduction https://preview.redd.it/l1nhb7vz59ba1.png?width=1920&format=png&auto=webp&s=3720dc7a454bdf6e4736a4ec3a4914647a06b564 If you want to start studying Deep Reinforcement Learning. We launched this course, and you’re right on time: 2023 is the perfect year to start. We wrote an introduction unit to help you get started. You can start learning now 👉 https://huggingface.co/deep-rl-course/unit0/introduction If you have questions or feedback I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 58 min )
    Gymnasium-Robotics 1.2.0 - which now includes maintained versions of the environments from D4RL and the MaMuJoCo environments - is now live
    submitted by /u/jkterry1 [link] [comments]  ( 57 min )
    Deep RL with Mujoco environments using Docker on Apple Silicon
    Hi everyone, I am relatively new to the apple ecosystem, their way of doing things and docker in general. I want to get started with using mujoco environments from within a docker image. ​ The reason for the same being: I do not want to pollute my path and OS with multiple mujoco and deep RL library's dependencies. Secondly, I would like to have a reproducible image that can be used cross platform. ​ I generally debug on my personal machine, and do the actual training on university provided machines which utilise Nvidia hardware. Is there a way to accomplish all my goals? If not, can I at least successfully complete goal 1? submitted by /u/adeecc [link] [comments]  ( 65 min )
    Reconstruction loss for VAE model for skill learning
    I was reading this paper ASPiRe: Adaptive Skill Priors for Reinforcement Learning When looking at their code to train the VAE and priors for skills, I noticed for the reconstruction loss, instead of using classical mean squared error, they use: loss = -Normal(loc=actions_hat, scale=1).logprob(actions) It can be seen in line 179 of this file. I had never seen anyone use this reconstruction loss. Is there a good reason to use this loss? any empirical support? I'd appreciate any help to address this question. Edit*: I just learnt this is the negative likelihood loss. My question still remains. Why is this preferable to MSE? submitted by /u/carlml [link] [comments]  ( 62 min )
  • Open

    A small reminder to appreciate the fundamentals!
    submitted by /u/Imagine-your-success [link] [comments]  ( 47 min )
    Any good ones for real video to animated?
    Hi, Looking for a solution to turn a real video into an animated video through AI or any automation process. I've some camera shy people that want to make a video. Animating it from scratch would make the price go 10x so I'm looking for a solution in the AI space. If you know, please drop a comment. submitted by /u/V-Sec [link] [comments]  ( 48 min )
    AI is not alien, it's us
    submitted by /u/pbw [link] [comments]  ( 46 min )
    AI stack for 2023 - any tools missing to work with this year?
    Thought this was a cool graphic - pulled from https://buildspace.so/notes/ai-stack-2023 (free resource) https://preview.redd.it/uxp5tbcmp9ba1.png?width=456&format=png&auto=webp&s=e31e1f81b7250ddafa5359579e00b4f600def00b submitted by /u/bruclinbrocoli [link] [comments]  ( 49 min )
    in a new article on Hackernoon, i write about how copy and paste can be said to be a forerunner for the digital revolution of the AI text generator. kindly read it here: https://hackernoon.com/from-copy-and-paste-to-ai-text-generator-a-revolution-of-the-digital-age
    submitted by /u/Techoyy [link] [comments]  ( 47 min )
    Is law future proof career ?
    As a newly graduated lawyer, I have been experiencing anxiety about the future of my profession. With the rapid advancements in technology, I can't help but wonder if the legal field will become foreseeable future and if my four years of law school were all for vain , my fears increased after release of GPT-3 , should i think about career change on software dev/Web dev to adapt with changing reality ? submitted by /u/No_Car5573 [link] [comments]  ( 52 min )
    Nerf Technology with Stable Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    German Drugmaker BioNTech To Buy AI Startup InstaDeep In $680M Deal
    submitted by /u/The-Techie [link] [comments]  ( 47 min )
    A good free text generator
    Hey, im looking for an actual free text generator that i can write simple articles of 100-300 words with. Most of these "free" sites offer a certain amount of credits etc. Preferably in multiple languages. Anyone have any ideas? submitted by /u/HutsKoning69 [link] [comments]  ( 48 min )
    What is the best paper on AI that you have read in 2022 and why?
    submitted by /u/tiensss [link] [comments]  ( 48 min )
    Some Ultra-Modern Generative Ai
    submitted by /u/Imagine-your-success [link] [comments]  ( 54 min )
    Bringing Extinct Dinosaurs Back To Life Using AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 47 min )
    How You can Apply AI Logics for Scaling Your Business in 2023?
    submitted by /u/yudiz [link] [comments]  ( 48 min )
    Microsoft's AI Tool VALL-E can imitate anyone's voice with just a three-second sample
    submitted by /u/qptbook [link] [comments]  ( 47 min )
    Microsoft Will Likely Invest $10 billion for 49 Percent Stake in OpenAI
    submitted by /u/BackgroundResult [link] [comments]  ( 62 min )
    Scanning text for sensitive/controversial material
    Is there any way to use AI to process text like that? I heard that HR companies have similar processes to go trough social media profiles and look at “red flags”, but is there any engine available to do that on any piece of text? Sorry if it seems like an obvious question, I’m not too well versed in the topic and google is no help as usual. submitted by /u/Big_Razzmatazz_9251 [link] [comments]  ( 52 min )
    Microsoft to Own 49% of OpenAI Once $10B Deal Closes
    submitted by /u/lambolifeofficial [link] [comments]  ( 45 min )
    ChatGPT - code and talks
    submitted by /u/adrp23 [link] [comments]  ( 47 min )
    AI-powered "robot" lawyer will be first of its kind to represent defendant in court
    submitted by /u/Itchy0101 [link] [comments]  ( 49 min )
    chatGPT knowingly withholds information, reveals it slowly upon nudging
    submitted by /u/reportaman [link] [comments]  ( 56 min )
    Best Language AI models to run locally?
    I was curious as to if there are any Language AI models (like Chat GPT) that you can use locally on your own machine (so you don't run into as many server issues like you do on Chat GPT). If so, which are the best and where can you find them? submitted by /u/CreativePolymath [link] [comments]  ( 46 min )
    artflow ai + waifu2x. Tatar slim girl with black hair +
    submitted by /u/SubjectAd1535 [link] [comments]  ( 47 min )
    How can you use/get access to Chinchilla AI?
    So I've been hearing about the Chinchilla AI model (text ai, like Chat GPT) and how great it is, but I haven't seen anything regarding the model or how to use it, or even if it's available to the public yet or not. Does anyone have any insight into this? Is there any way to use it currently? I've worked with Stable Diffusion through A1111, would there be a model that works with A1111? Thanks in advance! submitted by /u/CreativePolymath [link] [comments]  ( 47 min )
    Weekly China AI News from Jan.2 to Jan.8: AI Dominates 2023 Top Tech Trends; Behind Douyin's Popular AI Anime Effect; Go Master Accused of Cheating with AI-Like Play Style
    submitted by /u/trcytony [link] [comments]  ( 46 min )
  • Open

    DSC Weekly 10 January 2023 – Recession Analysis
    Announcements Recession Analysis Editor’s Note: This week’s editorial was written by a guest contributor. If you would like to submit an article or editorial, please contact the editors below. One question that many are pondering is whether we are entering in a recession, and when it will end. Especially regarding the tech sector. Are there… Read More »DSC Weekly 10 January 2023 – Recession Analysis The post DSC Weekly 10 January 2023 – Recession Analysis appeared first on Data Science Central.  ( 20 min )
    Solution vs. Product-Implications for Agile Development
    A Solution is Not a Product Solution, Deliverable, Product, Work Product and other terms are often used interchangeably to describe output from development initiatives.  However, there are some extremely important conceptual differences between the terms Solution and Product and understanding them should inform and guide your thinking about what you are doing as you go… Read More »Solution vs. Product-Implications for Agile Development The post Solution vs. Product-Implications for Agile Development appeared first on Data Science Central.  ( 22 min )
    Responsible AI by design
    Happy new year! One of the big trends in AI this year: AI is maturing as a domain. We are using AI to address complex problems. That means we will need to be more aware of the potential downsides of AI. I believe that a new trend could manifest: responsible AI by design. a) Responsible… Read More »Responsible AI by design The post Responsible AI by design appeared first on Data Science Central.  ( 19 min )
    Flexible Engagement Model to Hire Full-Stack Developers: A 2023 Guide
    The demand for web and app development is rising as there are significant improvements in the field of technology every year. More and more people are establishing careers in the field of development as a result of advancements in technologies and frameworks. As a result, the data from Statista says that by 2024, there will… Read More »Flexible Engagement Model to Hire Full-Stack Developers: A 2023 Guide The post Flexible Engagement Model to Hire Full-Stack Developers: A 2023 Guide appeared first on Data Science Central.  ( 21 min )
  • Open

    is there any software like AI or image editing I could use for generating an image of a paper with text on it (like someone wrote on it)
    Do you guys know these apps like yandex image translator that generate text out of an image ? I want to basically do the opposite but into images of papers (like someone has written on it) submitted by /u/SnooPineapples7791 [link] [comments]  ( 47 min )
    Nerf Technology with Stable Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 51 min )
  • Open

    The Greenest Generation: NVIDIA, Intel and Partners Supercharge AI Computing Efficiency
    AI is at the heart of humanity’s most transformative innovations — from developing COVID vaccines at unprecedented speeds and diagnosing cancer to powering autonomous vehicles and understanding climate change. Virtually every industry will benefit from adopting AI, but the technology has become more resource intensive as neural networks have increased in complexity. To avoid placing Read article >  ( 7 min )
  • Open

    Best practices for load testing Amazon SageMaker real-time inference endpoints
    Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so […]  ( 12 min )
    Get smarter search results with the Amazon Kendra Intelligent Ranking and OpenSearch plugin
    If you’ve had the opportunity to build a search application for unstructured data (i.e., wiki, informational web sites, self-service help pages, internal documentation, etc.) using open source or commercial-off-the-shelf search engines, then you’re probably familiar with the inherent accuracy challenges involved in getting relevant search results. The intended meaning of both query and document can […]  ( 12 min )
  • Open

    Approximating 1/Γ(x)
    A few days ago a comment that a graph looked like a Maxwell-Boltzman density lead to an approximation of 1/Γ(x), possibly a useful approximation. Approximating Γ(x) is a well-known problem, and for large x the solution is to use Stirling’s approximation or a few more terms from the asymptotic series that Stirling’s approximation is a […] Approximating 1/Γ(x) first appeared on John D. Cook.  ( 6 min )
  • Open

    Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers. (arXiv:2203.09142v2 [cs.LG] UPDATED)
    Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the nonsmooth functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.
    Reliable Time Prediction in the Markov Stochastic Block Model. (arXiv:2004.04402v3 [cs.SI] UPDATED)
    We introduce the Markov Stochastic Block Model (MSBM): a growth model for community based networks where node attributes are assigned through a Markovian dynamic. We rely on HMMs' literature to design prediction methods that are robust to local clustering errors. We focus specifically on the link prediction and collaborative filtering problems and we introduce a new model selection procedure to infer the number of hidden clusters in the network. Our approaches for reliable prediction in MSBMs are not algorithm-dependent in the sense that they can be applied using your favourite clustering tool. In this paper, we use a recent SDP method to infer the hidden communities and we provide theoretical guarantees. In particular, we identify the relevant signal-to-noise ratio (SNR) in our framework and we prove that the misclassification error decays exponentially fast with respect to this SNR.  ( 2 min )
    Signal Enhancement for Magnetic Navigation Challenge Problem. (arXiv:2007.12158v2 [cs.LG] UPDATED)
    Harnessing the magnetic field of the Earth for navigation has shown promise as a viable alternative to other navigation systems. A magnetic navigation system collects its own magnetic field data using a magnetometer and uses magnetic anomaly maps to determine the current location. The greatest challenge with magnetic navigation arises when the magnetic field measurements from the magnetometer encompass the magnetic field from not just the Earth, but also from the vehicle on which it is mounted. It is difficult to separate the Earth magnetic anomaly field, which is crucial for navigation, from the total magnetic field reading from the sensor. The purpose of this challenge problem is to decouple the Earth and aircraft magnetic signals in order to derive a clean signal from which to perform magnetic navigation. Baseline testing on the dataset has shown that the Earth magnetic field can be extracted from the total magnetic field using machine learning (ML). The challenge is to remove the aircraft magnetic field from the total magnetic field using a trained model. This challenge offers an opportunity to construct an effective model for removing the aircraft magnetic field from the dataset by using a scientific machine learning (SciML) approach comprised of an ML algorithm integrated with the physics of magnetic navigation.  ( 2 min )
    Ranking Inferences Based on the Top Choice of Multiway Comparisons. (arXiv:2211.11957v3 [stat.ME] UPDATED)
    This paper considers ranking inference of $n$ items based on the observed data on the top choice among $M$ randomly selected items at each trial. This is a useful modification of the Plackett-Luce model for $M$-way ranking with only the top choice observed and is an extension of the celebrated Bradley-Terry-Luce model that corresponds to $M=2$. Under a uniform sampling scheme in which any $M$ distinguished items are selected for comparisons with probability $p$ and the selected $M$ items are compared $L$ times with multinomial outcomes, we establish the statistical rates of convergence for underlying $n$ preference scores using both $\ell_2$-norm and $\ell_\infty$-norm, with the minimum sampling complexity. In addition, we establish the asymptotic normality of the maximum likelihood estimator that allows us to construct confidence intervals for the underlying scores. Furthermore, we propose a novel inference framework for ranking items through a sophisticated maximum pairwise difference statistic whose distribution is estimated via a valid Gaussian multiplier bootstrap. The estimated distribution is then used to construct simultaneous confidence intervals for the differences in the preference scores and the ranks of individual items. They also enable us to address various inference questions on the ranks of these items. Extensive simulation studies lend further support to our theoretical results. A real data application illustrates the usefulness of the proposed methods convincingly.  ( 2 min )
    Evaluating counterfactual explanations using Pearl's counterfactual method. (arXiv:2301.02499v1 [stat.ML])
    Counterfactual explanations (CEs) are methods for generating an alternative scenario that produces a different desirable outcome. For example, if a student is predicted to fail a course, then counterfactual explanations can provide the student with alternate ways so that they would be predicted to pass. The applications are many. However, CEs are currently generated from machine learning models that do not necessarily take into account the true causal structure in the data. By doing this, bias can be introduced into the CE quantities. I propose in this study to test the CEs using Judea Pearl's method of computing counterfactuals which has thus far, surprisingly, not been seen in the counterfactual explanation (CE) literature. I furthermore evaluate these CEs on three different causal structures to show how the true underlying causal structure affects the CEs that are generated. This study presented a method of evaluating CEs using Pearl's method and it showed, (although using a limited sample size), that thirty percent of the CEs conflicted with those computed by Pearl's method. This shows that we cannot simply trust CEs and it is vital for us to know the true causal structure before we blindly compute counterfactuals using the original machine learning model.  ( 2 min )
    Learning from a Biased Sample. (arXiv:2209.01754v2 [stat.ME] UPDATED)
    The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it in. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in simulations and a case study on ICU length of stay prediction.  ( 2 min )
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v5 [cs.LG] UPDATED)
    Different from existing Differential Privacy (DP) accountants, we introduce pro-active DP. Existing DP accountants keep track of how privacy budget has been spent while pro-active DP is a scheme that allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. To implement this idea, we show how to convert the classical DP moment accountant to a pro-active DP by exploiting the fact that it has a simple close form for computing spent privacy budget for a given interaction round. The DP moment accountant is introduced in context of DP-SGD and has the following property which is the key ingredient to build pro-active DP. In DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise with standard deviation $\sigma$ is added to local SGD updates after performing a clipping operation and normalizing with the clipping constant. We show that for attaining $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ for $\epsilon=\Omega(T/N^2)$, where $T$ is the total number of rounds and $N$ is equal to the size of the local data set. In many existing machine learning problems, $N$ is always large and $T=O(N)$. Hence, $\sigma$ becomes ``independent'' of any $T=O(N)$ choice with $\epsilon=\Omega(1/N)$. This means that our {\em $\sigma$ only depends on $N$ rather than $T$}. We show how this differential privacy characterization allows us to convert DP moment accountant to a pro-active DP.  ( 3 min )
    Valid P-Value for Deep Learning-Driven Salient Region. (arXiv:2301.02437v1 [stat.ML])
    Various saliency map methods have been proposed to interpret and explain predictions of deep learning models. Saliency maps allow us to interpret which parts of the input signals have a strong influence on the prediction results. However, since a saliency map is obtained by complex computations in deep learning models, it is often difficult to know how reliable the saliency map itself is. In this study, we propose a method to quantify the reliability of a salient region in the form of p-values. Our idea is to consider a salient region as a selected hypothesis by the trained deep learning model and employ the selective inference framework. The proposed method can provably control the probability of false positive detections of salient regions. We demonstrate the validity of the proposed method through numerical examples in synthetic and real datasets. Furthermore, we develop a Keras-based framework for conducting the proposed selective inference for a wide class of CNNs without additional implementation cost.  ( 2 min )
    DANLIP: Deep Autoregressive Networks for Locally Interpretable Probabilistic Forecasting. (arXiv:2301.02332v1 [cs.LG])
    Despite the high performance of neural network-based time series forecasting methods, the inherent challenge in explaining their predictions has limited their applicability in certain application areas. Due to the difficulty in identifying causal relationships between the input and output of such black-box methods, they rarely have been adopted in domains such as legal and medical fields in which the reliability and interpretability of the results can be essential. In this paper, we propose \model, a novel deep learning-based probabilistic time series forecasting architecture that is intrinsically interpretable. We conduct experiments with multiple datasets and performance metrics and empirically show that our model is not only interpretable but also provides comparable performance to state-of-the-art probabilistic time series forecasting methods. Furthermore, we demonstrate that interpreting the parameters of the stochastic processes of interest can provide useful insights into several application areas.  ( 2 min )
    Multi-treatment Effect Estimation from Biomedical Data. (arXiv:2112.07574v3 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v3 [stat.ML] UPDATED)
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    Low-rank Approximation of Linear Maps. (arXiv:1812.09042v2 [stat.ML] UPDATED)
    This work provides closed-form solutions and minimum achievable errors for a large class of low-rank approximation problems in Hilbert spaces. The proposed theorem generalizes to the case of bounded linear operators the previous results obtained in the finite dimensional case for the Frobenius norm. The theorem provides the basis for the design of tractable algorithms for kernel or continuous DMD.  ( 2 min )
    A Robust Data-driven Process Modeling Applied to Time-series Stochastic Power Flow. (arXiv:2301.02651v1 [eess.SY])
    In this paper, we propose a robust data-driven process model whose hyperparameters are robustly estimated using the Schweppe-type generalized maximum likelihood estimator. The proposed model is trained on recorded time-series data of voltage phasors and power injections to perform a time-series stochastic power flow calculation. Power system data are often corrupted with outliers caused by large errors, fault conditions, power outages, and extreme weather, to name a few. The proposed model downweights vertical outliers and bad leverage points in the measurements of the training dataset. The weights used to bound the influence of the outliers are calculated using projection statistics, which are a robust version of Mahalanobis distances of the time series data points. The proposed method is demonstrated on the IEEE 33-Bus power distribution system and a real-world unbalanced 240-bus power distribution system heavily integrated with renewable energy sources. Our simulation results show that the proposed robust model can handle up to 25% of outliers in the training data set.
    Reversibility of elliptical slice sampling revisited. (arXiv:2301.02426v1 [math.ST])
    We discuss the well-definedness of elliptical slice sampling, a Markov chain approach for approximate sampling of posterior distributions introduced by Murray, Adams and MacKay 2010. We point to a regularity requirement and provide an alternative proof of the reversibility property. In particular, this guarantees the correctness of the slice sampling scheme also on infinite-dimensional separable Hilbert spaces.  ( 2 min )
    Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction. (arXiv:2103.08280v5 [math.OC] UPDATED)
    In this paper, we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider Proximal Incremental First-order (PIFO) algorithms which have access to the gradient and proximal oracles for each component function. To incorporate loopless methods, we also allow PIFO algorithms to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We can also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles.
    Convergence rates of the stochastic alternating algorithm for bi-objective optimization. (arXiv:2203.10605v2 [math.OC] UPDATED)
    Stochastic alternating algorithms for bi-objective optimization are considered when optimizing two conflicting functions for which optimization steps have to be applied separately for each function. Such algorithms consist of applying a certain number of steps of gradient or subgradient descent on each single objective at each iteration. In this paper, we show that stochastic alternating algorithms achieve a sublinear convergence rate of $\mathcal{O}(1/T)$, under strong convexity, for the determination of a minimizer of a weighted-sum of the two functions, parameterized by the number of steps applied on each of them. An extension to the convex case is presented for which the rate weakens to $\mathcal{O}(1/\sqrt{T})$. These rates are valid also in the non-smooth case. Importantly, by varying the proportion of steps applied to each function, one can determine an approximation to the Pareto front.
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces. (arXiv:2211.14400v2 [stat.ML] UPDATED)
    Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev space $W^s(L_q(\Omega))$ with error measured in $L_p(\Omega)$. This problem is important when studying the application of neural networks in scientific computing and has previously been completely solved only in the case $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$, including asymptotically matching upper and lower bounds. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.
  • Open

    Jensen-Shannon Divergence Based Loss Functions for Bayesian Neural Networks. (arXiv:2209.11366v2 [cs.LG] UPDATED)
    The Kullback-Leibler (KL) divergence is widely used for the variational inference of Bayesian Neural Networks (BNNs) to approximate the posterior distribution of weights. However, the KL divergence is unbounded and asymmetric, which may lead to instabilities during optimization or may yield poor generalizations. To overcome these limitations, we examine the Jensen-Shannon (JS) divergence that is more general, bounded, and symmetric. Towards this, we propose two novel loss functions for BNNs: 1) a geometric JS divergence (JS-G) based loss function that is symmetric but unbounded with closed-form expression for Gaussian priors and 2) a generalized JS divergence (JS-A) based loss function that is symmetric and bounded. We show that the conventional KL divergence-based loss function is a special case of the loss functions presented in this work. To evaluate the divergence part of the proposed JS-G-based loss function, we use an exact closed-form expression for Gaussian priors. For any other priors of JS-G and for the JS-A-based loss function we use Monte Carlo approximation. We provide algorithms to optimize the loss function using both these methods. The proposed loss functions offer additional parameters that can be tuned to control the regularisation. We explain the reason why the proposed loss functions should perform better than the state-of-the-art. Further, we derive the conditions under which the proposed JS-G-loss function regularises better than the KL divergence-based loss function for Gaussian priors and posteriors. The proposed JS divergence-based Bayesian convolutional neural networks (BCNN) perform better than the state-of-the-art BCNN, which is shown for the classification of the CIFAR data set having various degrees of noise and a biased histopathology data set.
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces. (arXiv:2211.14400v2 [stat.ML] UPDATED)
    Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev space $W^s(L_q(\Omega))$ with error measured in $L_p(\Omega)$. This problem is important when studying the application of neural networks in scientific computing and has previously been completely solved only in the case $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$, including asymptotically matching upper and lower bounds. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.
    Proportional Multicalibration. (arXiv:2209.14613v2 [cs.LG] UPDATED)
    Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose \emph{proportional multicalibration}, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its \emph{differential calibration}, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultaneous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
    DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning. (arXiv:2111.12062v2 [cs.LG] UPDATED)
    Self-supervised learning algorithms, including BERT and SimCLR, have enabled significant strides in fields like natural language processing, computer vision, and speech processing. However, these algorithms are domain-specific, meaning that new self-supervised learning algorithms must be developed for each new setting, including myriad healthcare, scientific, and multimodal domains. To catalyze progress toward domain-agnostic methods, we introduce DABS: a Domain-Agnostic Benchmark for Self-supervised learning. To perform well on DABS, an algorithm is evaluated on seven diverse domains: natural images, multichannel sensor data, English text, speech recordings, multilingual text, chest x-rays, and images with text descriptions. Each domain contains an unlabeled dataset for pretraining; the model is then is scored based on its downstream performance on a set of labeled tasks in the domain. We also present e-Mix and ShED: two baseline domain-agnostic algorithms; their relatively modest performance demonstrates that significant progress is needed before self-supervised learning is an out-of-the-box solution for arbitrary domains. Code for benchmark datasets and baseline algorithms is available at https://github.com/alextamkin/dabs.
    p-Adic Statistical Field Theory and Deep Belief Networks. (arXiv:2207.13877v4 [math-ph] UPDATED)
    In this work we initiate the study of the correspondence between p-adic statistical field theories (SFTs) and neural networks (NNs). In general quantum field theories over a p-adic spacetime can be formulated in a rigorous way. Nowadays these theories are considered just mathematical toy models for understanding the problems of the true theories. In this work we show these theories are deeply connected with the deep belief networks (DBNs). Hinton et al. constructed DBNs by stacking several restricted Boltzmann machines (RBMs). The purpose of this construction is to obtain a network with a hierarchical structure (a deep learning architecture). An RBM corresponds to a certain spin glass, we argue that a DBN should correspond to an ultrametric spin glass. A model of such a system can be easily constructed by using p-adic numbers. In our approach, a p-adic SFT corresponds to a p-adic continuous DBN, and a discretization of this theory corresponds to a p-adic discrete DBN. We show that these last machines are universal approximators. In the p-adic framework, the correspondence between SFTs and NNs is not fully developed. We point out several open problems.
    Inversion of sea surface currents from satellite-derived SST-SSH synergies with 4DVarNets. (arXiv:2211.13059v2 [physics.ao-ph] UPDATED)
    Satellite altimetry is a unique way for direct observations of sea surface dynamics. This is however limited to the surface-constrained geostrophic component of sea surface velocities. Ageostrophic dynamics are however expected to be significant for horizontal scales below 100~km and time scale below 10~days. The assimilation of ocean general circulation models likely reveals only a fraction of this ageostrophic component. Here, we explore a learning-based scheme to better exploit the synergies between the observed sea surface tracers, especially sea surface height (SSH) and sea surface temperature (SST), to better inform sea surface currents. More specifically, we develop a 4DVarNet scheme which exploits a variational data assimilation formulation with trainable observations and {\em a priori} terms. An Observing System Simulation Experiment (OSSE) in a region of the Gulf Stream suggests that SST-SSH synergies could reveal sea surface velocities for time scales of 2.5-3.0 days and horizontal scales of 0.5$^\circ$-0.7$^\circ$, including a significant fraction of the ageostrophic dynamics ($\approx$ 47\%). The analysis of the contribution of different observation data, namely nadir along-track altimetry, wide-swath SWOT altimetry and SST data, emphasizes the role of SST features for the reconstruction at horizontal spatial scales ranging from \nicefrac{1}{20}$^\circ$ to \nicefrac{1}{4}$^\circ$.
    TarViS: A Unified Approach for Target-based Video Segmentation. (arXiv:2301.02657v1 [cs.CV])
    The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v3 [stat.ML] UPDATED)
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    When Spectral Modeling Meets Convolutional Networks: A Method for Discovering Reionization-era Lensed Quasars in Multi-band Imaging Data. (arXiv:2211.14543v2 [astro-ph.GA] UPDATED)
    Over the last two decades, around 300 quasars have been discovered at $z\gtrsim6$, yet only one has identified as being strongly gravitationally lensed. We explore a new approach -- enlarging the permitted spectral parameter space, while introducing a new spatial geometry veto criterion -- which is implemented via image-based deep learning. We first apply this approach to a systematic search for reionization-era lensed quasars, using data from the Dark Energy Survey, the Visible and Infrared Survey Telescope for Astronomy Hemisphere Survey, and the Wide-field Infrared Survey Explorer.Our search method consists of two main parts: (i) the preselection of the candidates based on their spectral energy distributions (SEDs) using catalog-level photometry and (ii) relative probabilities calculation of the candidates being a lens or some contaminant, utilizing a convolutional neural network (CNN) classification. The training data sets are constructed by painting deflected point-source lights over actual galaxy images, to generate realistic galaxy-quasar lens models, optimized to find systems with small image separations, i.e., Einstein radii of $\theta_\mathrm{E} \leq 1$ arcsec. Visual inspection is then performed for sources with CNN scores of $P_\mathrm{lens} > 0.1$, which leads us to obtain 36 newly selected lens candidates, which are awaiting spectroscopic confirmation. These findings show that automated SED modeling and deep learning pipelines, supported by modest human input, are a promising route for detecting strong lenses from large catalogs that can overcome the veto limitations of primarily dropout-based SED selection approaches.
    Utilising physics-guided deep learning to overvome data scarcity. (arXiv:2211.15664v2 [cs.LG] UPDATED)
    Deep learning (DL) relies heavily on data, and the quality of data influences its performance significantly. However, obtaining high-quality, well-annotated datasets can be challenging or even impossible in many real-world applications, such as structural risk estimation and medical diagnosis. This presents a significant barrier to the practical implementation of DL in these fields. Physics-guided deep learning (PGDL) is a novel type of DL that can integrate physics laws to train neural networks. This can be applied to any systems that are controlled or governed by physics laws, such as mechanics, finance and medical applications. It has been demonstrated that, with the additional information provided by physics laws, PGDL achieves great accuracy and generalisation in the presence of data scarcity. This review provides a detailed examination of PGDL and offers a structured overview of its use in addressing data scarcity across various fields, including physics, engineering and medical applications. Moreover, the review identifies the current limitations and opportunities for PGDL in relation to data scarcity and offers a thorough discussion on the future prospects of PGDL.
    Expanding boundaries of Gap Safe screening. (arXiv:2102.10846v2 [cs.LG] UPDATED)
    Sparse optimization problems are ubiquitous in many fields such as statistics, signal/image processing and machine learning. This has led to the birth of many iterative algorithms to solve them. A powerful strategy to boost the performance of these algorithms is known as safe screening: it allows the early identification of zero coordinates in the solution, which can then be eliminated to reduce the problem's size and accelerate convergence. In this work, we extend the existing Gap Safe screening framework by relaxing the global strong-concavity assumption on the dual cost function. Instead, we exploit local regularity properties, that is, strong concavity on well-chosen subsets of the domain. The non-negativity constraint is also integrated to the existing framework. Besides making safe screening possible to a broader class of functions that includes beta-divergences (e.g., the Kullback-Leibler divergence), the proposed approach also improves upon the existing Gap Safe screening rules on previously applicable cases (e.g., logistic regression). The proposed general framework is exemplified by some notable particular cases: logistic function, beta = 1.5 and Kullback-Leibler divergences. Finally, we showcase the effectiveness of the proposed screening rules with different solvers (coordinate descent, multiplicative-update and proximal gradient algorithms) and different data sets (binary classification, hyperspectral and count data).
    Two Wrongs Don't Make a Right: Combating Confirmation Bias in Learning with Label Noise. (arXiv:2112.02960v3 [cs.LG] UPDATED)
    Noisy labels damage the performance of deep networks. For robust learning, a prominent two-stage pipeline alternates between eliminating possible incorrect labels and semi-supervised training. However, discarding part of noisy labels could result in a loss of information, especially when the corruption has a dependency on data, e.g., class-dependent or instance-dependent. Moreover, from the training dynamics of a representative two-stage method DivideMix, we identify the domination of confirmation bias: pseudo-labels fail to correct a considerable amount of noisy labels, and consequently, the errors accumulate. To sufficiently exploit information from noisy labels and mitigate wrong corrections, we propose Robust Label Refurbishment (Robust LR) a new hybrid method that integrates pseudo-labeling and confidence estimation techniques to refurbish noisy labels. We show that our method successfully alleviates the damage of both label noise and confirmation bias. As a result, it achieves state-of-the-art performance across datasets and noise types, namely CIFAR under different levels of synthetic noise and Mini-WebVision and ANIMAL-10N with real-world noise.
    Fast and Low-Memory Deep Neural Networks Using Binary Matrix Factorization. (arXiv:2210.13468v2 [cs.LG] UPDATED)
    Despite the outstanding performance of deep neural networks in different applications, they are still computationally extensive and require a great number of memories. This motivates more research on reducing the resources required for implementing such networks. An efficient approach addressed for this purpose is matrix factorization, which has been shown to be effective on different networks. In this paper, we utilize binary matrix factorization and show its great efficiency in reducing the required number of resources in deep neural networks. In effect, this technique can lead to the practical implementation of such networks.
    First Go, then Post-Explore: the Benefits of Post-Exploration in Intrinsic Motivation. (arXiv:2212.03251v2 [cs.LG] UPDATED)
    Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper, we present a clear ablation study of post-exploration in a general intrinsically motivated goal exploration process (IMGEP) framework, that the Go-Explore paper did not show. We study the isolated potential of post-exploration, by turning it on and off within the same algorithm under both tabular and deep RL settings on both discrete navigation and continuous control tasks. Experiments on a range of MiniGrid and Mujoco environments show that post-exploration indeed helps IMGEP agents reach more diverse states and boosts their performance. In short, our work suggests that RL researchers should consider to use post-exploration in IMGEP when possible since it is effective, method-agnostic and easy to implement.
    Centralized Cooperative Exploration Policy for Continuous Control Tasks. (arXiv:2301.02375v1 [cs.LG])
    The deep reinforcement learning (DRL) algorithm works brilliantly on solving various complex control tasks. This phenomenal success can be partly attributed to DRL encouraging intelligent agents to sufficiently explore the environment and collect diverse experiences during the agent training process. Therefore, exploration plays a significant role in accessing an optimal policy for DRL. Despite recent works making great progress in continuous control tasks, exploration in these tasks has remained insufficiently investigated. To explicitly encourage exploration in continuous control tasks, we propose CCEP (Centralized Cooperative Exploration Policy), which utilizes underestimation and overestimation of value functions to maintain the capacity of exploration. CCEP first keeps two value functions initialized with different parameters, and generates diverse policies with multiple exploration styles from a pair of value functions. In addition, a centralized policy framework ensures that CCEP achieves message delivery between multiple policies, furthermore contributing to exploring the environment cooperatively. Extensive experimental results demonstrate that CCEP achieves higher exploration capacity. Empirical analysis shows diverse exploration styles in the learned policies by CCEP, reaping benefits in more exploration regions. And this exploration capacity of CCEP ensures it outperforms the current state-of-the-art methods across multiple continuous control tasks shown in experiments.
    How Powerful are K-hop Message Passing Graph Neural Networks. (arXiv:2205.13328v4 [cs.LG] UPDATED)
    The most popular design paradigm for Graph Neural Networks (GNNs) is 1-hop message passing -- aggregating information from 1-hop neighbors repeatedly. However, the expressive power of 1-hop message passing is bounded by the Weisfeiler-Lehman (1-WL) test. Recently, researchers extended 1-hop message passing to K-hop message passing by aggregating information from K-hop neighbors of nodes simultaneously. However, there is no work on analyzing the expressive power of K-hop message passing. In this work, we theoretically characterize the expressive power of K-hop message passing. Specifically, we first formally differentiate two different kernels of K-hop message passing which are often misused in previous works. We then characterize the expressive power of K-hop message passing by showing that it is more powerful than 1-WL and can distinguish almost all regular graphs. Despite the higher expressive power, we show that K-hop message passing still cannot distinguish some simple regular graphs and its expressive power is bounded by 3-WL. To further enhance its expressive power, we introduce a KP-GNN framework, which improves K-hop message passing by leveraging the peripheral subgraph information in each hop. We show that KP-GNN can distinguish many distance regular graphs which could not be distinguished by previous distance encoding or 3-WL methods. Experimental results verify the expressive power and effectiveness of KP-GNN. KP-GNN achieves competitive results across all benchmark datasets.
    Multi-Agent Reinforcement Learning for Fast-Timescale Demand Response of Residential Loads. (arXiv:2301.02593v1 [cs.MA])
    To integrate high amounts of renewable energy resources, electrical power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained with multi-agent proximal policy optimization with localized communication. We explore two communication frameworks: hand-engineered, or learned through targeted multi-agent communication. The resulting policies perform well and robustly for frequency regulation, and scale seamlessly to arbitrary numbers of houses for constant processing times.
    Signal Enhancement for Magnetic Navigation Challenge Problem. (arXiv:2007.12158v2 [cs.LG] UPDATED)
    Harnessing the magnetic field of the Earth for navigation has shown promise as a viable alternative to other navigation systems. A magnetic navigation system collects its own magnetic field data using a magnetometer and uses magnetic anomaly maps to determine the current location. The greatest challenge with magnetic navigation arises when the magnetic field measurements from the magnetometer encompass the magnetic field from not just the Earth, but also from the vehicle on which it is mounted. It is difficult to separate the Earth magnetic anomaly field, which is crucial for navigation, from the total magnetic field reading from the sensor. The purpose of this challenge problem is to decouple the Earth and aircraft magnetic signals in order to derive a clean signal from which to perform magnetic navigation. Baseline testing on the dataset has shown that the Earth magnetic field can be extracted from the total magnetic field using machine learning (ML). The challenge is to remove the aircraft magnetic field from the total magnetic field using a trained model. This challenge offers an opportunity to construct an effective model for removing the aircraft magnetic field from the dataset by using a scientific machine learning (SciML) approach comprised of an ML algorithm integrated with the physics of magnetic navigation.
    Understanding Urban Water Consumption using Remotely Sensed Data. (arXiv:2205.02932v2 [cs.CV] UPDATED)
    Urban metabolism is an active field of research that deals with the estimation of emissions and resource consumption from urban regions. The analysis could be carried out through a manual surveyor by the implementation of elegant machine learning algorithms. In this exploratory work, we estimate the water consumption by the buildings in the region captured by satellite imagery. To this end, we break our analysis into three parts: i) Identification of building pixels, given a satellite image, followed by ii) identification of the building type (residential/non-residential) from the building pixels, and finally iii) using the building pixels along with their type to estimate the water consumption using the average per unit area consumption for different building types as obtained from municipal surveys.
    Learning Invariant Rules from Data for Interpretable Anomaly Detection. (arXiv:2211.13577v2 [cs.LG] UPDATED)
    In the research area of anomaly detection, novel and promising methods are frequently developed. However, most existing studies exclusively focus on the detection task only and ignore the interpretability of the underlying models as well as their detection results. However, anomaly interpretation, which aims to provide explanation of why specific data instances are identified as anomalies, is an equally important task in many real-world applications. In this work, we propose a novel framework which synergizes several machine learning and data mining techniques to automatically learn invariant rules that are consistently satisfied in the training data. The learned invariant rules can provide explicit explanation of anomaly detection results and thus are extremely useful for subsequent decision-making regarding reported anomalies. Furthermore, our empirical evaluation shows that the proposed method can also achieve comparable or even better performance in terms of AUC and partial AUC on public benchmark datasets across various application domains compared with start-of-the-art anomaly detection models.
    Silent Killer: Optimizing Backdoor Trigger Yields a Stealthy and Powerful Data Poisoning Attack. (arXiv:2301.02615v1 [cs.CR])
    We propose a stealthy and powerful backdoor attack on neural networks based on data poisoning (DP). In contrast to previous attacks, both the poison and the trigger in our method are stealthy. We are able to change the model's classification of samples from a source class to a target class chosen by the attacker. We do so by using a small number of poisoned training samples with nearly imperceptible perturbations, without changing their labels. At inference time, we use a stealthy perturbation added to the attacked samples as a trigger. This perturbation is crafted as a universal adversarial perturbation (UAP), and the poison is crafted using gradient alignment coupled to this trigger. Our method is highly efficient in crafting time compared to previous methods and requires only a trained surrogate model without additional retraining. Our attack achieves state-of-the-art results in terms of attack success rate while maintaining high accuracy on clean samples.
    Superficial White Matter Analysis: An Efficient Point-cloud-based Deep Learning Framework with Supervised Contrastive Learning for Consistent Tractography Parcellation across Populations and dMRI Acquisitions. (arXiv:2207.08975v2 [eess.IV] UPDATED)
    Diffusion MRI tractography is an advanced imaging technique that enables in vivo mapping of the brain's white matter connections. White matter parcellation classifies tractography streamlines into clusters or anatomically meaningful tracts. It enables quantification and visualization of whole-brain tractography. Currently, most parcellation methods focus on the deep white matter (DWM), whereas fewer methods address the superficial white matter (SWM) due to its complexity. We propose a novel two-stage deep-learning-based framework, Superficial White Matter Analysis (SupWMA), that performs an efficient and consistent parcellation of 198 SWM clusters from whole-brain tractography. A point-cloud-based network is adapted to our SWM parcellation task, and supervised contrastive learning enables more discriminative representations between plausible streamlines and outliers for SWM. We train our model on a large-scale tractography dataset including streamline samples from labeled long- and medium-range (over 40mm) SWM clusters and anatomically implausible streamline samples, and we perform testing on six independently acquired datasets of different ages and health conditions (including neonates and patients with space-occupying brain tumors). Compared to several state-of-the-art methods, SupWMA obtains highly consistent and accurate SWM parcellation results on all datasets, showing good generalization across the lifespan in health and disease. In addition, the computational speed of SupWMA is much faster than other methods.
    Triple-stream Deep Metric Learning of Great Ape Behavioural Actions. (arXiv:2301.02642v1 [cs.CV])
    We propose the first metric learning system for the recognition of great ape behavioural actions. Our proposed triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates that the utilisation of an explicit DensePose-C chimpanzee body part segmentation stream effectively complements traditional RGB appearance and optical flow streams. We evaluate system variants with different feature fusion techniques and long-tail recognition approaches. Results and ablations show performance improvements of ~12% in top-1 accuracy over previous results achieved on the PanAf-500 dataset containing 180,000 manually annotated frames across nine behavioural actions. Furthermore, we provide a qualitative analysis of our findings and augment the metric learning system with long-tail recognition techniques showing that average per class accuracy -- critical in the domain -- can be improved by ~23% compared to the literature on that dataset. Finally, since our embedding spaces are constructed as metric, we provide first data-driven visualisations of the great ape behavioural action spaces revealing emerging geometry and topology. We hope that the work sparks further interest in this vital application area of computer vision for the benefit of endangered great apes.
    Learning from a Biased Sample. (arXiv:2209.01754v2 [stat.ME] UPDATED)
    The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it in. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in simulations and a case study on ICU length of stay prediction.
    Does compressing activations help model parallel training?. (arXiv:2301.02654v1 [cs.LG])
    Large-scale Transformer models are known for their exceptional performance in a range of tasks, but training them can be difficult due to the requirement for communication-intensive model parallelism. One way to improve training speed is to compress the message size in communication. Previous approaches have primarily focused on compressing gradients in a data parallelism setting, but compression in a model-parallel setting is an understudied area. We have discovered that model parallelism has fundamentally different characteristics than data parallelism. In this work, we present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms - pruning-based, learning-based, and quantization-based - using a popular Transformer training framework. We evaluate these methods across more than 160 settings and 8 popular datasets, taking into account different hyperparameters, hardware, and both fine-tuning and pre-training stages. We also provide analysis when the model is scaled up. Finally, we provide insights for future development of model parallelism compression algorithms.
    SEQUENT: Towards Traceable Quantum Machine Learning using Sequential Quantum Enhanced Training. (arXiv:2301.02601v1 [quant-ph])
    Applying new computing paradigms like quantum computing to the field of machine learning has recently gained attention. However, as high-dimensional real-world applications are not yet feasible to be solved using purely quantum hardware, hybrid methods using both classical and quantum machine learning paradigms have been proposed. For instance, transfer learning methods have been shown to be successfully applicable to hybrid image classification tasks. Nevertheless, beneficial circuit architectures still need to be explored. Therefore, tracing the impact of the chosen circuit architecture and parameterization is crucial for the development of beneficially applicable hybrid methods. However, current methods include processes where both parts are trained concurrently, therefore not allowing for a strict separability of classical and quantum impact. Thus, those architectures might produce models that yield a superior prediction accuracy whilst employing the least possible quantum impact. To tackle this issue, we propose Sequential Quantum Enhanced Training (SEQUENT) an improved architecture and training process for the traceable application of quantum computing methods to hybrid machine learning. Furthermore, we provide formal evidence for the disadvantage of current methods and preliminary experimental results as a proof-of-concept for the applicability of SEQUENT.
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v5 [cs.LG] UPDATED)
    Different from existing Differential Privacy (DP) accountants, we introduce pro-active DP. Existing DP accountants keep track of how privacy budget has been spent while pro-active DP is a scheme that allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. To implement this idea, we show how to convert the classical DP moment accountant to a pro-active DP by exploiting the fact that it has a simple close form for computing spent privacy budget for a given interaction round. The DP moment accountant is introduced in context of DP-SGD and has the following property which is the key ingredient to build pro-active DP. In DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise with standard deviation $\sigma$ is added to local SGD updates after performing a clipping operation and normalizing with the clipping constant. We show that for attaining $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ for $\epsilon=\Omega(T/N^2)$, where $T$ is the total number of rounds and $N$ is equal to the size of the local data set. In many existing machine learning problems, $N$ is always large and $T=O(N)$. Hence, $\sigma$ becomes ``independent'' of any $T=O(N)$ choice with $\epsilon=\Omega(1/N)$. This means that our {\em $\sigma$ only depends on $N$ rather than $T$}. We show how this differential privacy characterization allows us to convert DP moment accountant to a pro-active DP.
    Provable Reset-free Reinforcement Learning by No-Regret Reduction. (arXiv:2301.02389v1 [cs.LG])
    Real-world reinforcement learning (RL) is often severely limited since typical RL algorithms heavily rely on the reset mechanism to sample proper initial states. In practice, the reset mechanism is expensive to implement due to the need for human intervention or heavily engineered environments. To make learning more practical, we propose a generic no-regret reduction to systematically design reset-free RL algorithms. Our reduction turns reset-free RL into a two-player game. We show that achieving sublinear regret in this two player game would imply learning a policy that has both sublinear performance regret and sublinear total number of resets in the original RL problem. This means that the agent eventually learns to perform optimally and avoid resets. By this reduction, we design an instantiation for linear Markov decision processes, which is the first provably correct reset-free RL algorithm to our knowledge.
    Text-Based Automatic Personality Prediction Using KGrAt-Net; A Knowledge Graph Attention Network Classifier. (arXiv:2205.13780v2 [cs.CL] UPDATED)
    Nowadays, a tremendous amount of human communications occur on Internet-based communication infrastructures, like social networks, email, forums, organizational communication platforms, etc. Indeed, the automatic prediction or assessment of individuals' personalities through their written or exchanged text would be advantageous to ameliorate their relationships. To this end, this paper aims to propose KGrAt-Net, which is a Knowledge Graph Attention Network text classifier. For the first time, it applies the knowledge graph attention network to perform Automatic Personality Prediction (APP), according to the Big Five personality traits. After performing some preprocessing activities, it first tries to acquire a knowing-full representation of the knowledge behind the concepts in the input text by building its equivalent knowledge graph. A knowledge graph collects interlinked descriptions of concepts, entities, and relationships in a machine-readable form. Practically, it provides a machine-readable cognitive understanding of concepts and semantic relationships among them. Then, applying the attention mechanism, it attempts to pay attention to the most relevant parts of the graph to predict the personality traits of the input text. We used 2,467 essays from the Essays Dataset. The results demonstrated that KGrAt-Net considerably improved personality prediction accuracies (up to 70.26% on average). Furthermore, KGrAt-Net also uses knowledge graph embedding to enrich the classification, which makes it even more accurate (on average, 72.41%) in APP.
    Approximate Real Symmetric Tensor Rank. (arXiv:2207.12529v3 [math.NA] UPDATED)
    We investigate the effect of an $\varepsilon$-room of perturbation tolerance on symmetric tensor decomposition. To be more precise, suppose a real symmetric $d$-tensor $f$, a norm $||.||$ on the space of symmetric $d$-tensors, and $\varepsilon >0$ are given. What is the smallest symmetric tensor rank in the $\varepsilon$-neighborhood of $f$? In other words, what is the symmetric tensor rank of $f$ after a clever $\varepsilon$-perturbation? We prove two theorems and develop three corresponding algorithms that give constructive upper bounds for this question. With expository goals in mind; we present probabilistic and convex geometric ideas behind our results, reproduce some known results, and point out open problems.
    "No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy. (arXiv:2301.02555v1 [cs.RO])
    Systems for language-guided human-robot interaction must satisfy two key desiderata for broad adoption: adaptivity and learning efficiency. Unfortunately, existing instruction-following agents cannot adapt, lacking the ability to incorporate online natural language supervision, and even if they could, require hundreds of demonstrations to learn even simple policies. In this work, we address these problems by presenting Language-Informed Latent Actions with Corrections (LILAC), a framework for incorporating and adapting to natural language corrections - "to the right," or "no, towards the book" - online, during execution. We explore rich manipulation domains within a shared autonomy paradigm. Instead of discrete turn-taking between a human and robot, LILAC splits agency between the human and robot: language is an input to a learned model that produces a meaningful, low-dimensional control space that the human can use to guide the robot. Each real-time correction refines the human's control space, enabling precise, extended behaviors - with the added benefit of requiring only a handful of demonstrations to learn. We evaluate our approach via a user study where users work with a Franka Emika Panda manipulator to complete complex manipulation tasks. Compared to existing learned baselines covering both open-loop instruction following and single-turn shared autonomy, we show that our corrections-aware approach obtains higher task completion rates, and is subjectively preferred by users because of its reliability, precision, and ease of use.
    Multi-treatment Effect Estimation from Biomedical Data. (arXiv:2112.07574v3 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v4 [cs.LG] UPDATED)
    Cellular sheaves equip graphs with a "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain competitive results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
    Low-rank Approximation of Linear Maps. (arXiv:1812.09042v2 [stat.ML] UPDATED)
    This work provides closed-form solutions and minimum achievable errors for a large class of low-rank approximation problems in Hilbert spaces. The proposed theorem generalizes to the case of bounded linear operators the previous results obtained in the finite dimensional case for the Frobenius norm. The theorem provides the basis for the design of tractable algorithms for kernel or continuous DMD.
    Interpretable Disease Prediction based on Reinforcement Path Reasoning over Knowledge Graphs. (arXiv:2010.08300v2 [cs.LG] UPDATED)
    Objective: To combine medical knowledge and medical data to interpretably predict the risk of disease. Methods: We formulated the disease prediction task as a random walk along a knowledge graph (KG). Specifically, we build a KG to record relationships between diseases and risk factors according to validated medical knowledge. Then, a mathematical object walks along the KG. It starts walking at a patient entity, which connects the KG based on the patient current diseases or risk factors and stops at a disease entity, which represents the predicted disease. The trajectory generated by the object represents an interpretable disease progression path of the given patient. The dynamics of the object are controlled by a policy-based reinforcement learning (RL) module, which is trained by electronic health records (EHRs). Experiments: We utilized two real-world EHR datasets to evaluate the performance of our model. In the disease prediction task, our model achieves 0.743 and 0.639 in terms of macro area under the curve (AUC) in predicting 53 circulation system diseases in the two datasets, respectively. This performance is comparable to the commonly used machine learning (ML) models in medical research. In qualitative analysis, our clinical collaborator reviewed the disease progression paths generated by our model and advocated their interpretability and reliability. Conclusion: Experimental results validate the proposed model in interpretably evaluating and optimizing disease prediction. Significance: Our work contributes to leveraging the potential of medical knowledge and medical data jointly for interpretable prediction tasks.
    Multifidelity Modeling for Physics-Informed Neural Networks (PINNs). (arXiv:2106.13361v2 [physics.comp-ph] UPDATED)
    Multifidelity simulation methodologies are often used in an attempt to judiciously combine low-fidelity and high-fidelity simulation results in an accuracy-increasing, cost-saving way. Candidates for this approach are simulation methodologies for which there are fidelity differences connected with significant computational cost differences. Physics-informed Neural Networks (PINNs) are candidates for these types of approaches due to the significant difference in training times required when different fidelities (expressed in terms of architecture width and depth as well as optimization criteria) are employed. In this paper, we propose a particular multifidelity approach applied to PINNs that exploits low-rank structure. We demonstrate that width, depth, and optimization criteria can be used as parameters related to model fidelity, and show numerical justification of cost differences in training due to fidelity parameter choices. We test our multifidelity scheme on various canonical forward PDE models that have been presented in the emerging PINNs literature.
    Incremental Without Replacement Sampling in Nonconvex Optimization. (arXiv:2007.07557v4 [cs.LG] UPDATED)
    Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting we prove that the sequence is attracted by solutions of optimality conditions of the problem.
    Neuro-DynaStress: Predicting Dynamic Stress Distributions in Structural Components. (arXiv:2301.02580v1 [physics.geo-ph])
    Structural components are typically exposed to dynamic loading, such as earthquakes, wind, and explosions. Structural engineers should be able to conduct real-time analysis in the aftermath or during extreme disaster events requiring immediate corrections to avoid fatal failures. As a result, it is crucial to predict dynamic stress distributions during highly disruptive events in real-time. Currently available high-fidelity methods, such as Finite Element Models (FEMs), suffer from their inherent high complexity and are computationally prohibitive. Therefore, to reduce computational cost while preserving accuracy, a deep learning model, Neuro-DynaStress, is proposed to predict the entire sequence of stress distribution based on finite element simulations using a partial differential equation (PDE) solver. The model was designed and trained to use the geometry, boundary conditions and sequence of loads as input and predict the sequences of high-resolution stress contours. The performance of the proposed framework is compared to finite element simulations using a PDE solver.
    Architect, Regularize and Replay (ARR): a Flexible Hybrid Approach for Continual Learning. (arXiv:2301.02464v1 [cs.LG])
    In recent years we have witnessed a renewed interest in machine learning methodologies, especially for deep representation learning, that could overcome basic i.i.d. assumptions and tackle non-stationary environments subject to various distributional shifts or sample selection biases. Within this context, several computational approaches based on architectural priors, regularizers and replay policies have been proposed with different degrees of success depending on the specific scenario in which they were developed and assessed. However, designing comprehensive hybrid solutions that can flexibly and generally be applied with tunable efficiency-effectiveness trade-offs still seems a distant goal. In this paper, we propose "Architect, Regularize and Replay" (ARR), an hybrid generalization of the renowned AR1 algorithm and its variants, that can achieve state-of-the-art results in classic scenarios (e.g. class-incremental learning) but also generalize to arbitrary data streams generated from real-world datasets such as CIFAR-100, CORe50 and ImageNet-1000.
    Quantum reinforcement learning in continuous action space. (arXiv:2012.10711v3 [quant-ph] UPDATED)
    Quantum reinforcement learning (QRL) is one promising algorithm proposed for near-term quantum devices. Early QRL proposals are effective at solving problems in discrete action space, but often suffer from the curse of dimensionality in the continuous domain due to discretization. To address this problem, we propose a quantum Deep Deterministic Policy Gradient algorithm that is efficient at solving both classical and quantum sequential decision problems in the continuous domain. As an application, our method can solve the quantum state-generation problem in a single shot: it only requires a one-shot optimization to generate a model that outputs the desired control sequence for arbitrary target state. In comparison, the standard quantum control method requires optimizing for each target state. Moreover, our method can also be used to physically reconstruct an unknown quantum state.
    Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers. (arXiv:2203.09142v2 [cs.LG] UPDATED)
    Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the nonsmooth functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.
    IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling. (arXiv:2301.02445v1 [cs.AI])
    Multimodal knowledge graph link prediction aims to improve the accuracy and efficiency of link prediction tasks for multimodal data. However, for complex multimodal information and sparse training data, it is usually difficult to achieve interpretability and high accuracy simultaneously for most methods. To address this difficulty, a new model is developed in this paper, namely Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling (IMKGA-SM). First, a multi-modal fine-grained fusion method is proposed, and Vgg16 and Optical Character Recognition (OCR) techniques are adopted to effectively extract text information from images and images. Then, the knowledge graph link prediction task is modelled as an offline reinforcement learning Markov decision model, which is then abstracted into a unified sequence framework. An interactive perception-based reward expectation mechanism and a special causal masking mechanism are designed, which ``converts" the query into an inference path. Then, an autoregressive dynamic gradient adjustment mechanism is proposed to alleviate the insufficient problem of multimodal optimization. Finally, two datasets are adopted for experiments, and the popular SOTA baselines are used for comparison. The results show that the developed IMKGA-SM achieves much better performance than SOTA baselines on multimodal link prediction datasets of different sizes.
    Topics as Entity Clusters: Entity-based Topics from Language Models and Graph Neural Networks. (arXiv:2301.02458v1 [cs.CL])
    Topic models aim to reveal the latent structure behind a corpus, typically conducted over a bag-of-words representation of documents. In the context of topic modeling, most vocabulary is either irrelevant for uncovering underlying topics or contains strong relationships with relevant concepts, impacting the interpretability of these topics. Furthermore, their limited expressiveness and dependency on language demand considerable computation resources. Hence, we propose a novel approach for cluster-based topic modeling that employs conceptual entities. Entities are language-agnostic representations of real-world concepts rich in relational information. To this end, we extract vector representations of entities from (i) an encyclopedic corpus using a language model; and (ii) a knowledge base using a graph neural network. We demonstrate that our approach consistently outperforms other state-of-the-art topic models across coherency metrics and find that the explicit knowledge encoded in the graph-based embeddings provides more coherent topics than the implicit knowledge encoded with the contextualized embeddings of language models.
    Deep leakage from gradients. (arXiv:2301.02621v1 [cs.CR])
    With the development of artificial intelligence technology, Federated Learning (FL) model has been widely used in many industries for its high efficiency and confidentiality. Some researchers have explored its confidentiality and designed some algorithms to attack training data sets, but these algorithms all have their own limitations. Therefore, most people still believe that local machine learning gradient information is safe and reliable. In this paper, an algorithm based on gradient features is designed to attack the federated learning model in order to attract more attention to the security of federated learning systems. In federated learning system, gradient contains little information compared with the original training data set, but this project intends to restore the original training image data through gradient information. Convolutional Neural Network (CNN) has excellent performance in image processing. Therefore, the federated learning model of this project is equipped with Convolutional Neural Network structure, and the model is trained by using image data sets. The algorithm calculates the virtual gradient by generating virtual image labels. Then the virtual gradient is matched with the real gradient to restore the original image. This attack algorithm is written in Python language, uses cat and dog classification Kaggle data sets, and gradually extends from the full connection layer to the convolution layer, thus improving the universality. At present, the average squared error between the data recovered by this algorithm and the original image information is approximately 5, and the vast majority of images can be completely restored according to the gradient information given, indicating that the gradient of federated learning system is not absolutely safe and reliable.
    BELLATREX: Building Explanations through a LocaLly AccuraTe Rule EXtractor. (arXiv:2203.15511v2 [cs.LG] UPDATED)
    Tree-ensemble algorithms, such as random forest, are effective machine learning methods popular for their flexibility, high performance, and robustness to overfitting. However, since multiple learners are combined, they are not as interpretable as a single decision tree. In this work we propose a novel method that is Building Explanations through a LocalLy AccuraTe Rule EXtractor (Bellatrex), and is able to explain the forest prediction for a given test instance with only a few diverse rules. Starting from the decision trees generated by a random forest, our method 1) pre-selects a subset of the rules used to make the prediction, 2) creates a vector representation of such rules, 3) projects them to a low-dimensional space, 4) clusters such representations to pick a rule from each cluster to explain the instance prediction. We test the effectiveness of Bellatrex on 89 real-world datasets and we demonstrate the validity of our method for binary classification, regression, multi-label classification and time-to-event tasks. To the best of our knowledge, it is the first time that an interpretability toolbox can handle all these tasks within the same framework. We also show that our extracted surrogate model can approximate the performance of the corresponding ensemble model in all considered tasks, while selecting only few trees from the whole forest. We also show that our proposed approach substantially outperforms other explainable methods in terms of predictive performance.
    Task Aware Feature Extraction Framework for Sequential Dependence Multi-Task Learning. (arXiv:2301.02494v1 [cs.LG])
    Multi-task learning (MTL) has been successfully implemented in many real-world applications, which aims to simultaneously solve multiple tasks with a single model. The general idea of multi-task learning is designing kinds of global parameter sharing mechanism and task-specific feature extractor to improve the performance of all tasks. However, sequential dependence between tasks are rarely studied but frequently encountered in e-commence online recommendation, e.g. impression, click and conversion on displayed product. There is few theoretical work on this problem and biased optimization object adopted in most MTL methods deteriorates online performance. Besides, challenge still remains in balancing the trade-off between various tasks and effectively learn common and specific representation. In this paper, we first analyze sequential dependence MTL from rigorous mathematical perspective and design a dependence task learning loss to provide an unbiased optimizing object. And we propose a Task Aware Feature Extraction (TAFE) framework for sequential dependence MTL, which enables to selectively reconstruct implicit shared representations from a sample-wise view and extract explicit task-specific information in an more efficient way. Extensive experiments on offline datasets and online A/B implementation demonstrate the effectiveness of our proposed TAFE.
    Machine Fault Classification using Hamiltonian Neural Networks. (arXiv:2301.02243v1 [cs.LG])
    A new approach is introduced to classify faults in rotating machinery based on the total energy signature estimated from sensor measurements. The overall goal is to go beyond using black-box models and incorporate additional physical constraints that govern the behavior of mechanical systems. Observational data is used to train Hamiltonian neural networks that describe the conserved energy of the system for normal and various abnormal regimes. The estimated total energy function, in the form of the weights of the Hamiltonian neural network, serves as the new feature vector to discriminate between the faults using off-the-shelf classification models. The experimental results are obtained using the MaFaulDa database, where the proposed model yields a promising area under the curve (AUC) of $0.78$ for the binary classification (normal vs abnormal) and $0.84$ for the multi-class problem (normal, and $5$ different abnormal regimes).
    Learning Personalized Brain Functional Connectivity of MDD Patients from Multiple Sites via Federated Bayesian Networks. (arXiv:2301.02423v1 [cs.LG])
    Identifying functional connectivity biomarkers of major depressive disorder (MDD) patients is essential to advance understanding of the disorder mechanisms and early intervention. However, due to the small sample size and the high dimension of available neuroimaging data, the performance of existing methods is often limited. Multi-site data could enhance the statistical power and sample size, while they are often subject to inter-site heterogeneity and data-sharing policies. In this paper, we propose a federated joint estimator, NOTEARS-PFL, for simultaneous learning of multiple Bayesian networks (BNs) with continuous optimization, to identify disease-induced alterations in MDD patients. We incorporate information shared between sites and site-specific information into the proposed federated learning framework to learn personalized BN structures by introducing the group fused lasso penalty. We develop the alternating direction method of multipliers, where in the local update step, the neuroimaging data is processed at each local site. Then the learned network structures are transmitted to the center for the global update. In particular, we derive a closed-form expression for the local update step and use the iterative proximal projection method to deal with the group fused lasso penalty in the global update step. We evaluate the performance of the proposed method on both synthetic and real-world multi-site rs-fMRI datasets. The results suggest that the proposed NOTEARS-PFL yields superior effectiveness and accuracy than the comparable methods.
    Deep learning for full-field ultrasonic characterization. (arXiv:2301.02378v1 [math.NA])
    This study takes advantage of recent advances in machine learning to establish a physics-based data analytic platform for distributed reconstruction of mechanical properties in layered components from full waveform data. In this vein, two logics, namely the direct inversion and physics-informed neural networks (PINNs), are explored. The direct inversion entails three steps: (i) spectral denoising and differentiation of the full-field data, (ii) building appropriate neural maps to approximate the profile of unknown physical and regularization parameters on their respective domains, and (iii) simultaneous training of the neural networks by minimizing the Tikhonov-regularized PDE loss using data from (i). PINNs furnish efficient surrogate models of complex systems with predictive capabilities via multitask learning where the field variables are modeled by neural maps endowed with (scaler or distributed) auxiliary parameters such as physical unknowns and loss function weights. PINNs are then trained by minimizing a measure of data misfit subject to the underlying physical laws as constraints. In this study, to facilitate learning from ultrasonic data, the PINNs loss adopts (a) wavenumber-dependent Sobolev norms to compute the data misfit, and (b) non-adaptive weights in a specific scaling framework to naturally balance the loss objectives by leveraging the form of PDEs germane to elastic-wave propagation. Both paradigms are examined via synthetic and laboratory test data. In the latter case, the reconstructions are performed at multiple frequencies and the results are verified by a set of complementary experiments highlighting the importance of verification and validation in data-driven modeling.
    Integrating Transformer and Autoencoder Techniques with Spectral Graph Algorithms for the Prediction of Scarcely Labeled Molecular Data. (arXiv:2211.06759v2 [cs.LG] UPDATED)
    In molecular and biological sciences, experiments are expensive, time-consuming, and often subject to ethical constraints. Consequently, one often faces the challenging task of predicting desirable properties from small data sets or scarcely-labeled data sets. Although transfer learning can be advantageous, it requires the existence of a related large data set. This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme are integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder, in order to deal with scarcely-labeled data sets. In addition, a consensus technique is detailed. The proposed models are validated using five benchmark data sets. We also provide a thorough comparison to other competing methods, such as support vector machines, random forests, and gradient boosting decision trees, which are known for their good performance on small data sets. The performances of various methods are analyzed using residue-similarity (R-S) scores and R-S indices. Extensive computational experiments and theoretical analysis show that the new models perform very well even when as little as 1% of the data set is used as labeled data.
    Start Small: Training Controllable Game Level Generators without Training Data by Learning at Multiple Sizes. (arXiv:2209.15052v2 [cs.LG] UPDATED)
    A level generator is a tool that generates game levels from noise. Training a generator without a dataset suffers from feedback sparsity, since it is unlikely to generate a playable level via random exploration. A common solution is shaped rewards, which guides the generator to achieve subgoals towards level playability, but they consume effort to design and require game-specific domain knowledge. This paper proposes a novel approach to train generators without datasets or shaped rewards by learning at multiple level sizes starting from small sizes and up to the desired sizes. The denser feedback at small sizes negates the need for shaped rewards. Additionally, the generators learn to build levels at various sizes, including sizes they were not trained for. We apply our approach to train recurrent auto-regressive generative flow networks (GFlowNets) for controllable level generation. We also adapt diversity sampling to be compatible with GFlowNets. The results show that our generators create diverse playable levels at various sizes for Sokoban, Zelda, and Danger Dave. When compared with controllable reinforcement learning level generators for Sokoban, the results show that our generators achieve better controllability and competitive diversity, while being 9x faster at training and level generation.
    Sequentially Controlled Text Generation. (arXiv:2301.02299v1 [cs.CL])
    While GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure. We study the problem of imposing structure on long-range text. We propose a novel controlled text generation task, sequentially controlled text generation, and identify a dataset, NewsDiscourse as a starting point for this task. We develop a sequential controlled text generation pipeline with generation and editing. We test different degrees of structural awareness and show that, in general, more structural awareness results in higher control-accuracy, grammaticality, coherency and topicality, approaching human-level writing performance.
    Restarts subject to approximate sharpness: A parameter-free and optimal scheme for first-order methods. (arXiv:2301.02268v1 [math.OC])
    Sharpness is an almost generic assumption in continuous optimization that bounds the distance from minima by objective function suboptimality. It leads to the acceleration of first-order methods via restarts. However, sharpness involves problem-specific constants that are typically unknown, and previous restart schemes reduce convergence rates. Moreover, such schemes are challenging to apply in the presence of noise or approximate model classes (e.g., in compressive imaging or learning problems), and typically assume that the first-order method used produces feasible iterates. We consider the assumption of approximate sharpness, a generalization of sharpness that incorporates an unknown constant perturbation to the objective function error. This constant offers greater robustness (e.g., with respect to noise or relaxation of model classes) for finding approximate minimizers. By employing a new type of search over the unknown constants, we design a restart scheme that applies to general first-order methods and does not require the first-order method to produce feasible iterates. Our scheme maintains the same convergence rate as when assuming knowledge of the constants. The rates of convergence we obtain for various first-order methods either match the optimal rates or improve on previously established rates for a wide range of problems. We showcase our restart scheme on several examples and point to future applications and developments of our framework and theory.
    Training trajectories, mini-batch losses and the curious role of the learning rate. (arXiv:2301.02312v1 [cs.LG])
    Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its efficiency and remarkable ability to converge to global minimum remains shrouded in mystery. The loss function defined on a large network with large amount of data is known to be non-convex. However, relatively little has been explored about the behavior of loss function on individual batches. Remarkably, we show that for ResNet the loss for any fixed mini-batch when measured along side SGD trajectory appears to be accurately modeled by a quadratic function. In particular, a very low loss value can be reached in just one step of gradient descent with large enough learning rate. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch and how the learning rate affects the relationship between improvement on individual and full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a few steps apart, also significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
    Evaluating counterfactual explanations using Pearl's counterfactual method. (arXiv:2301.02499v1 [stat.ML])
    Counterfactual explanations (CEs) are methods for generating an alternative scenario that produces a different desirable outcome. For example, if a student is predicted to fail a course, then counterfactual explanations can provide the student with alternate ways so that they would be predicted to pass. The applications are many. However, CEs are currently generated from machine learning models that do not necessarily take into account the true causal structure in the data. By doing this, bias can be introduced into the CE quantities. I propose in this study to test the CEs using Judea Pearl's method of computing counterfactuals which has thus far, surprisingly, not been seen in the counterfactual explanation (CE) literature. I furthermore evaluate these CEs on three different causal structures to show how the true underlying causal structure affects the CEs that are generated. This study presented a method of evaluating CEs using Pearl's method and it showed, (although using a limited sample size), that thirty percent of the CEs conflicted with those computed by Pearl's method. This shows that we cannot simply trust CEs and it is vital for us to know the true causal structure before we blindly compute counterfactuals using the original machine learning model.
    gRoMA: a Tool for Measuring Deep Neural Networks Global Robustness. (arXiv:2301.02288v1 [cs.LG])
    Deep neural networks (DNNs) are a state-of-the-art technology, capable of outstanding performance in many key tasks. However, it is challenging to integrate DNNs into safety-critical systems, such as those in the aerospace or automotive domains, due to the risk of adversarial inputs: slightly perturbed inputs that can cause the DNN to make grievous mistakes. Adversarial inputs have been shown to plague even modern DNNs; and so the risks they pose must be measured and mitigated to allow the safe deployment of DNNs in safety-critical systems. Here, we present a novel and scalable tool called gRoMA, which uses a statistical approach for formally measuring the global categorial robustness of a DNN - i.e., the probability of randomly encountering an adversarial input for a specific output category. Our tool operates on pre-trained, black-box classification DNNs. It randomly generates input samples that belong to an output category of interest, measures the DNN's susceptibility to adversarial inputs around these inputs, and then aggregates the results to infer the overall global robustness of the DNN up to some small bounded error. For evaluation purposes, we used gRoMA to measure the global robustness of the widespread Densenet DNN model over the CIFAR10 dataset and our results exposed significant gaps in the robustness of the different output categories. This experiment demonstrates the scalability of the new approach and showcases its potential for allowing DNNs to be deployed within critical systems of interest.
    Extreme Q-Learning: MaxEnt RL without Entropy. (arXiv:2301.02328v1 [cs.LG])
    Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from Economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on some tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks.
    Deep Latent Variable Models for Semi-supervised Paraphrase Generation. (arXiv:2301.02275v1 [cs.CL])
    This paper explores deep latent variable models for semi-supervised paraphrase generation, where the missing target pair is modelled as a latent paraphrase sequence. We present a novel unsupervised model named variational sequence auto-encoding reconstruction (VSAR), which performs latent sequence inference given an observed text. To leverage information from text pairs, we introduce a supervised model named dual directional learning (DDL). Combining VSAR with DDL (DDL+VSAR) enables us to conduct semi-supervised learning; however, the combined model suffers from a cold-start problem. To combat this issue, we propose to deal with better weight initialisation, leading to a two-stage training scheme named knowledge reinforced training. Our empirical evaluations suggest that the combined model yields competitive performance against the state-of-the-art supervised baselines on complete data. Furthermore, in scenarios where only a fraction of the labelled pairs are available, our combined model consistently outperforms the strong supervised model baseline (DDL and Transformer) by a significant margin.
    MSCDA: Multi-level Semantic-guided Contrast Improves Unsupervised Domain Adaptation for Breast MRI Segmentation in Small Datasets. (arXiv:2301.02554v1 [q-bio.QM])
    Deep learning (DL) applied to breast tissue segmentation in magnetic resonance imaging (MRI) has received increased attention in the last decade, however, the domain shift which arises from different vendors, acquisition protocols, and biological heterogeneity, remains an important but challenging obstacle on the path towards clinical implementation. Recently, unsupervised domain adaptation (UDA) methods have attempted to mitigate this problem by incorporating self-training with contrastive learning. To better exploit the underlying semantic information of the image at different levels, we propose a Multi-level Semantic-guided Contrastive Domain Adaptation (MSCDA) framework to align the feature representation between domains. In particular, we extend the contrastive loss by incorporating pixel-to-pixel, pixel-to-centroid, and centroid-to-centroid contrasts to integrate semantic information of images. We utilize a category-wise cross-domain sampling strategy to sample anchors from target images and build a hybrid memory bank to store samples from source images. Two breast MRI datasets were retrospectively collected: The source dataset contains non-contrast MRI examinations from 11 healthy volunteers and the target dataset contains contrast-enhanced MRI examinations of 134 invasive breast cancer patients. We set up experiments from source T2W image to target dynamic contrast-enhanced (DCE)-T1W image (T2W-to-T1W) and from source T1W image to target T2W image (T1W-to-T2W). The proposed method achieved Dice similarity coefficient (DSC) of 89.2\% and 84.0\% in T2W-to-T1W and T1W-to-T2W, respectively, outperforming state-of-the-art methods. Notably, good performance is still achieved with a smaller source dataset, proving that our framework is label-efficient.
    Conformal Loss-Controlling Prediction. (arXiv:2301.02424v1 [cs.LG])
    Conformal prediction is a learning framework controlling prediction coverage of prediction sets, which can be built on any learning algorithm for point prediction. This work proposes a learning framework named conformal loss-controlling prediction, which extends conformal prediction to the situation where the value of a loss function needs to be controlled. Different from existing works about risk-controlling prediction sets and conformal risk control with the purpose of controlling the expected values of loss functions, the proposed approach in this paper focuses on the loss for any test object, which is an extension of conformal prediction from miscoverage loss to some general loss. The controlling guarantee is proved under the assumption of exchangeability of data in finite-sample cases and the framework is tested empirically for classification with a class-varying loss and statistical postprocessing of numerical weather forecasting applications, which are introduced as point-wise classification and point-wise regression problems. All theoretical analysis and experimental results confirm the effectiveness of our loss-controlling approach.
    Combined mechanistic and machine learning method for construction of oil reservoir permeability map consistent with well test measurements. (arXiv:2301.02585v1 [physics.geo-ph])
    We propose a new method for construction of the absolute permeability map consistent with the interpreted results of well logging and well test measurements in oil reservoirs. Nadaraya-Watson kernel regression is used to approximate two-dimensional spatial distribution of the rock permeability. Parameters of the kernel regression are tuned by solving the optimization problem in which, for each well placed in an oil reservoir, we minimize the difference between the actual and predicted values of (i) absolute permeability at the well location (from well logging); (ii) absolute integral permeability of the domain around the well and (iii) skin factor (from well tests). Inverse problem is solved via multiple solutions to forward problems, in which we estimate the integral permeability of reservoir surrounding a well and the skin factor by the surrogate model. The last one is developed using an artificial neural network trained on the physics-based synthetic dataset generated using the procedure comprising the numerical simulation of bottomhole pressure decline curve in reservoir simulator followed by its interpretation using a semi-analytical reservoir model. The developed method for reservoir permeability map construction is applied to the available reservoir model (Egg Model) with highly heterogeneous permeability distribution due to the presence of highly-permeable channels. We showed that the constructed permeability map is hydrodynamically similar to the original one. Numerical simulations of production in the reservoir with constructed and original permeability maps are quantitatively similar in terms of the pore pressure and fluid saturations distribution at the end of the simulation period. Moreover, we obtained an good match between the obtained results of numerical simulations in terms of the flow rates and total volumes of produced oil, water and injected water.
    Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction. (arXiv:2103.08280v5 [math.OC] UPDATED)
    In this paper, we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider Proximal Incremental First-order (PIFO) algorithms which have access to the gradient and proximal oracles for each component function. To incorporate loopless methods, we also allow PIFO algorithms to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We can also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles.
    Reversibility of elliptical slice sampling revisited. (arXiv:2301.02426v1 [math.ST])
    We discuss the well-definedness of elliptical slice sampling, a Markov chain approach for approximate sampling of posterior distributions introduced by Murray, Adams and MacKay 2010. We point to a regularity requirement and provide an alternative proof of the reversibility property. In particular, this guarantees the correctness of the slice sampling scheme also on infinite-dimensional separable Hilbert spaces.
    Graph Contrastive Learning for Multi-omics Data. (arXiv:2301.02242v1 [q-bio.GN])
    Advancements in technologies related to working with omics data require novel computation methods to fully leverage information and help develop a better understanding of human diseases. This paper studies the effects of introducing graph contrastive learning to help leverage graph structure and information to produce better representations for downstream classification tasks for multi-omics datasets. We present a learnining framework named Multi-Omics Graph Contrastive Learner(MOGCL) which outperforms several aproaches for integrating multi-omics data for supervised learning tasks. We show that pre-training graph models with a contrastive methodology along with fine-tuning it in a supervised manner is an efficient strategy for multi-omics data classification.  ( 2 min )
    A Data-Driven Gaussian Process Filter for Electrocardiogram Denoising. (arXiv:2301.02607v1 [eess.SP])
    Objective: Gaussian Processes (GP)-based filters, which have been effectively used for various applications including electrocardiogram (ECG) filtering can be computationally demanding and the choice of their hyperparameters is typically ad hoc. Methods: We develop a data-driven GP filter to address both issues, using the notion of the ECG phase domain -- a time-warped representation of the ECG beats onto a fixed number of samples and aligned R-peaks, which is assumed to follow a Gaussian distribution. Under this assumption, the computation of the sample mean and covariance matrix is simplified, enabling an efficient implementation of the GP filter in a data-driven manner, with no ad hoc hyperparameters. The proposed filter is evaluated and compared with a state-of-the-art wavelet-based filter, on the PhysioNet QT Database. The performance is evaluated by measuring the signal-to-noise ratio (SNR) improvement of the filter at SNR levels ranging from -5 to 30dB, in 5dB steps, using additive noise. For a clinical evaluation, the error between the estimated QT-intervals of the original and filtered signals is measured and compared with the benchmark filter. Results: It is shown that the proposed GP filter outperforms the benchmark filter for all the tested noise levels. It also outperforms the state-of-the-art filter in terms of QT-interval estimation error bias and variance. Conclusion: The proposed GP filter is a versatile technique for preprocessing the ECG in clinical and research applications, is applicable to ECG of arbitrary lengths and sampling frequencies, and provides confidence intervals for its performance.  ( 2 min )
    Deep Biological Pathway Informed Pathology-Genomic Multimodal Survival Prediction. (arXiv:2301.02383v1 [q-bio.QM])
    The integration of multi-modal data, such as pathological images and genomic data, is essential for understanding cancer heterogeneity and complexity for personalized treatments, as well as for enhancing survival predictions. Despite the progress made in integrating pathology and genomic data, most existing methods cannot mine the complex inter-modality relations thoroughly. Additionally, identifying explainable features from these models that govern preclinical discovery and clinical prediction is crucial for cancer diagnosis, prognosis, and therapeutic response studies. We propose PONET- a novel biological pathway-informed pathology-genomic deep model that integrates pathological images and genomic data not only to improve survival prediction but also to identify genes and pathways that cause different survival rates in patients. Empirical results on six of The Cancer Genome Atlas (TCGA) datasets show that our proposed method achieves superior predictive performance and reveals meaningful biological interpretations. The proposed method establishes insight into how to train biologically informed deep networks on multimodal biomedical data which will have general applicability for understanding diseases and predicting response and resistance to treatment.  ( 2 min )
    TWR-MCAE: A Data Augmentation Method for Through-the-Wall Radar Human Motion Recognition. (arXiv:2301.02488v1 [eess.SP])
    To solve the problems of reduced accuracy and prolonging convergence time of through-the-wall radar (TWR) human motion due to wall attenuation, multipath effect, and system interference, we propose a multilink auto-encoding neural network (TWR-MCAE) data augmentation method. Specifically, the TWR-MCAE algorithm is jointly constructed by a singular value decomposition (SVD)-based data preprocessing module, an improved coordinate attention module, a compressed sensing learnable iterative shrinkage threshold reconstruction algorithm (LISTA) module, and an adaptive weight module. The data preprocessing module achieves wall clutter, human motion features, and noise subspaces separation. The improved coordinate attention module achieves clutter and noise suppression. The LISTA module achieves human motion feature enhancement. The adaptive weight module learns the weights and fuses the three subspaces. The TWR-MCAE can suppress the low-rank characteristics of wall clutter and enhance the sparsity characteristics in human motion at the same time. It can be linked before the classification step to improve the feature extraction capability without adding other prior knowledge or recollecting more data. Experiments show that the proposed algorithm gets a better peak signal-to-noise ratio (PSNR), which increases the recognition accuracy and speeds up the training process of the back-end classifiers.  ( 2 min )
    DANLIP: Deep Autoregressive Networks for Locally Interpretable Probabilistic Forecasting. (arXiv:2301.02332v1 [cs.LG])
    Despite the high performance of neural network-based time series forecasting methods, the inherent challenge in explaining their predictions has limited their applicability in certain application areas. Due to the difficulty in identifying causal relationships between the input and output of such black-box methods, they rarely have been adopted in domains such as legal and medical fields in which the reliability and interpretability of the results can be essential. In this paper, we propose \model, a novel deep learning-based probabilistic time series forecasting architecture that is intrinsically interpretable. We conduct experiments with multiple datasets and performance metrics and empirically show that our model is not only interpretable but also provides comparable performance to state-of-the-art probabilistic time series forecasting methods. Furthermore, we demonstrate that interpreting the parameters of the stochastic processes of interest can provide useful insights into several application areas.
    Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation. (arXiv:2301.02262v1 [eess.AS])
    This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.  ( 2 min )
    GNN-based Passenger Request Prediction. (arXiv:2301.02515v1 [cs.LG])
    Passenger request prediction is essential for operations planning, control, and management in ride-sharing platforms. While the demand prediction problem has been studied extensively, the Origin-Destination (OD) flow prediction of passengers has received less attention from the research community. This paper develops a Graph Neural Network framework along with the Attention Mechanism to predict the OD flow of passengers. The proposed framework exploits various linear and non-linear dependencies that arise among requests originating from different locations and captures the repetition pattern and the contextual data of that place. Moreover, the optimal size of the grid cell that covers the road network and preserves the complexity and accuracy of the model is determined. Extensive simulations are conducted to examine the characteristics of our proposed approach and its various components. The results show the superior performance of our proposed model compared to the existing baselines.  ( 2 min )
    TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. (arXiv:2301.02344v1 [cs.CR])
    With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools, based on large language models, are typically trained on massive corpora of code mined from unvetted public sources. As a result, these models are susceptible to data poisoning attacks where an adversary manipulates the model's training or fine-tuning phases by injecting malicious data. Poisoning attacks could be designed to influence the model's suggestions at run time for chosen contexts, such as inducing the model into suggesting insecure code payloads. To achieve this, prior poisoning attacks explicitly inject the insecure code payload into the training data, making the poisoning data detectable by static analysis tools that can remove such malicious data from the training set. In this work, we demonstrate two novel data poisoning attacks, COVERT and TROJANPUZZLE, that can bypass static analysis by planting malicious poisoning data in out-of-context regions such as docstrings. Our most novel attack, TROJANPUZZLE, goes one step further in generating less suspicious poisoning data by never including certain (suspicious) parts of the payload in the poisoned data, while still inducing a model that suggests the entire payload when completing code (i.e., outside docstrings). This makes TROJANPUZZLE robust against signature-based dataset-cleansing methods that identify and filter out suspicious sequences from the training data. Our evaluation against two model sizes demonstrates that both COVERT and TROJANPUZZLE have significant implications for how practitioners should select code used to train or tune code-suggestion models.  ( 2 min )
    Valid P-Value for Deep Learning-Driven Salient Region. (arXiv:2301.02437v1 [stat.ML])
    Various saliency map methods have been proposed to interpret and explain predictions of deep learning models. Saliency maps allow us to interpret which parts of the input signals have a strong influence on the prediction results. However, since a saliency map is obtained by complex computations in deep learning models, it is often difficult to know how reliable the saliency map itself is. In this study, we propose a method to quantify the reliability of a salient region in the form of p-values. Our idea is to consider a salient region as a selected hypothesis by the trained deep learning model and employ the selective inference framework. The proposed method can provably control the probability of false positive detections of salient regions. We demonstrate the validity of the proposed method through numerical examples in synthetic and real datasets. Furthermore, we develop a Keras-based framework for conducting the proposed selective inference for a wide class of CNNs without additional implementation cost.  ( 2 min )
    Multi-Genre Music Transformer -- Composing Full Length Musical Piece. (arXiv:2301.02385v1 [cs.SD])
    In the task of generating music, the art factor plays a big role and is a great challenge for AI. Previous work involving adversarial training to produce new music pieces and modeling the compatibility of variety in music (beats, tempo, musical stems) demonstrated great examples of learning this task. Though this was limited to generating mashups or learning features from tempo and key distributions to produce similar patterns. Compound Word Transformer was able to represent music generation task as a sequence generation challenge involving musical events defined by compound words. These musical events give a more accurate description of notes progression, chord change, harmony and the art factor. The objective of the project is to implement a Multi-Genre Transformer which learns to produce music pieces through more adaptive learning process involving more challenging task where genres or form of the composition is also considered. We built a multi-genre compound word dataset, implemented a linear transformer which was trained on this dataset. We call this Multi-Genre Transformer, which was able to generate full length new musical pieces which is diverse and comparable to original tracks. The model trains 2-5 times faster than other models discussed.  ( 2 min )
    Myths and Legends in High-Performance Computing. (arXiv:2301.02432v1 [cs.DC])
    In this humorous and thought provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We collected those myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within (and beyond) our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore's law. While some laws end, new directions open up, such as algorithmic scaling or novel architecture research. However, these myths are rarely based on scientific facts but often on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates such as the question whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.  ( 2 min )

  • Open

    Leveraging AI To Build Apps
    submitted by /u/emanresu_2017 [link] [comments]  ( 46 min )
    I programmed my phone to do standup comedy
    submitted by /u/iusereditt [link] [comments]  ( 46 min )
    AI Dream 147 - Unbelievable MINDBLOW AI Video - MASTERPIECE 30min
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    Is there a free AI writter that doesn't have a word limit
    Open AI playground is fine in all but it has its limits. It can only write so much. Anything stronger then Open AI but free to use? submitted by /u/Zan_korida [link] [comments]  ( 46 min )
    I built ColabRating, a site where you can show off your Google Colab.
    Colabs are how I got into AI, and I think they're a great place to start - looking at other people's, and building your own. I couldn't find any rating sites for Colabs, so I made one. Any suggestions as to how to make it better would be appreciated. I'll add categories as and when there are enough colabs on there to need them. https://colabrating.com/ submitted by /u/andysurtees [link] [comments]  ( 47 min )
    What is involved in training a language model like ChatGPT?
    ChatGPT is ok, but it isn't trained on specific niches (Greek history). You could train it on this data though. What all is involved in training a model like this to be able to talk to you as does ChatGPT but using data you give it? Is there an existing program you can use to do this? submitted by /u/eratonnn [link] [comments]  ( 49 min )
    Intelligent Document Processing System
    Hi! I am trying to learn about Intelligent Document Processing Need to build a automation tool to find some words in documents in pdf/word format Make a check list about what was found These documents are digitalizations from a scanner Some documents have 900 pages or more, and some have bad quality digitalizations from decades ago (probably need to setup a database for each word) I know there's several which can do that job, but I am looking for something more accessible, these available are too expensive targeting enterprises Any guidance would be very helpful! submitted by /u/ThereisNothingHeeree [link] [comments]  ( 51 min )
    Baidu Create 2022: AI Developer Conference
    Hey all r/artificial, I wanted to invite you to join us for Baidu Create, our annual AI developer conference. We'll be exploring the latest developments in AI technology and innovation, and discussing how we can shape the future of AI together with a global community of creators. You can watch the conference live on Baidu YouTube: (https://www.youtube.com/watch?v=LlydjVDYb3A) at 10:00 pm PDT on January 9th. As a sneak peek, here are some of the tech innovations that will be unveiled at Baidu Create: * A band of virtual persons * Big models for generative AI * A metaverse built in 40 days * Voice interaction without echo * Connecting cars & roads with a shared perception * Generative search engine * Everyone can quantum * Scientific computing * Next-gen computing for the future cloud https://preview.redd.it/s7e0gdvoa2ba1.png?width=1200&format=png&auto=webp&s=b5e2f6377ae01b8e5189a814355a0b5f798d58d4 submitted by /u/trcytony [link] [comments]  ( 47 min )
    ChatGPT as a Cheating Tool
    submitted by /u/BackgroundResult [link] [comments]  ( 49 min )
    What is GPTZero, the ChatGPT Watermark Alternative?
    submitted by /u/BackgroundResult [link] [comments]  ( 48 min )
    Searching for a medical Q-A Dataset to categroze answers given by patients in response to an AI question
    Hello everyone, ​ i recently started a project, for which I want to categorize the answers given by a human patient to a question asked by an AI to a category. Example: AI: Are you currently smoking? Patient: No, but I smoked until last september. I t was quite a decent amount, but I stopped, and havnt touched a cigarette since. Detected Category: No ​ I have searched far and wide for a dataset containing medical consultations with data annotated in that way, but havent found any. ​ How would you think is the best way to adress something like this without having to start collecting data? Thank you submitted by /u/Fabianslife [link] [comments]  ( 50 min )
    Microsoft to integrate ChatGPT into Office products
    submitted by /u/Number_5_alive [link] [comments]  ( 46 min )
    Stable Diffusion PC INSTALLATION 2023 UPDATE! AI Art For BEGINNERS!
    submitted by /u/PuppetHere [link] [comments]  ( 49 min )
    5 Growing Libraries in Python for Causality Analysis
    submitted by /u/pasticciociccio [link] [comments]  ( 49 min )
    Google AI Introduces Muse: A Text-To-Image Generation/Editing Model via Masked Generative Transformers
    submitted by /u/ai-lover [link] [comments]  ( 47 min )
    Researchers at Stanford have developed an Artificial Intelligence (AI) Model, SUMMON, that can generate Multi-Object Scenes from a Sequence of Human Interaction
    submitted by /u/ai-lover [link] [comments]  ( 47 min )
    What happens to OpenAI if it reaches $29 billion in valuation
    submitted by /u/bendee983 [link] [comments]  ( 49 min )
    ChatGPT is just the beginning: How advanced AI is set to enter a new era
    submitted by /u/moviesdusk [link] [comments]  ( 49 min )
    I asked ChatGPT to cast countries as villains in a movie
    submitted by /u/EvilCorpGame [link] [comments]  ( 51 min )
    Neural Search vs. Google Search: What's the difference?
    I read an article about neural search and for those who don’t know, it’s a way for computers to find stuff using these special programs called neural networks. It can be used in lots of different ways, like searching the web, or helping you find things on your computer. It can also find things that are close to what we're looking for. It can even search through images, audio, and video. Sometimes it's even better to use a combination of Neural Search and other methods to get the best results. Sounds a lot like something Google Search would do? But from what I understand, Google uses "artificial neural networks" to try and understand what we are looking for and find the best websites for it. But I think Google also uses lots of other ways to help us find what we are looking for, so it's not just using the neural networks. Anyone know the difference? submitted by /u/gabuzgab [link] [comments]  ( 51 min )
    Summate.it - Quickly summarise web articles with OpenAI
    submitted by /u/fivefilters [link] [comments]  ( 51 min )
    Where to get started with AI?
    As I've been browsing Twitter and various imageboards/forums lately I've been seeing all the rage about AI recently and just feel so disconnected from it. Basically, where do I go to get started with AI and the software to do the things I've seen around the internet like, image generation or say writing an essay? Thanks. submitted by /u/Decryptionite [link] [comments]  ( 46 min )
  • Open

    [P] Built an at-cost, pay per second, open-source API for Tortoise text-to-speech (best I've heard!)
    Improve Tortoise TTS by 30% inference speed, and packaged it up as a hosted API that charges per-second. All code is open-sourced: https://github.com/metavoicexyz/tortoise-tts-modal-api, https://github.com/metavoicexyz/tortoise-tts It can be used via a UI on: https://tts.themetavoice.xyz There are more details here: https://twitter.com/vatsal\_aggarwal/status/1612536547248836608?s=20 submitted by /u/Apprehensive-Tax-214 [link] [comments]  ( 57 min )
    [D] Do cloud gpu's run while my laptop is switched off?
    This might sound like a dumb question but do cloud gpu's that you rent still train data when I switch off my laptop since the GPU is still running somewhere in the cloud? Or does it switch off when I close my laptop automatically? Also would anyone know any cheap GPU cloud websites for training a tensorflow mnt model on 12 million Reddit comments and replies. (Idea from sentdex's Reddit chatbot tutorial). Thanks In advance :) submitted by /u/smileawe3211 [link] [comments]  ( 61 min )
    [D] Maarten Grootendorst: BERTopic, Data Science, Psychology | Learning from ML Episode 1
    This is the first episode of a new podcast on machine learning featuring Maarten Grootendorst. Maarten Grootendorst: BERTopic, Data Science, Psychology | Learning from Machine Learning #1 submitted by /u/slam0077 [link] [comments]  ( 56 min )
    [D] Looking for github package testing many decision tree models - it exists but I can't find it in my browser history
    Hi everyone, A couple of months ago I saw a github with a package testing many different decision tree models on the same (user provided) dataset, really fast, in python ; the goal is to select the optimal one programmatically. I can't remember if I discovered that package on hacker news, or on github trending. No luck sifting through my browser history. Would any of you recognise this description and know the package ? Help appreciated ! submitted by /u/Maaaaxime [link] [comments]  ( 73 min )
    [D] Am I reducing the dimensionality of the problem by using a categorial feature but with high cardinality?
    In practice, I am working with chemical formulations with thousands of ingredients. Using each ingredient (like a one-hot-encoding) would explode the dimensionality of the problem. I am thinking about grouping all these ingredients into their "functional role" (20 or so) . If so, I could greately reduce the number of features, but the cardinality would be high for each feature. Did the dimensionality really go down from a thousand to 20? My intution tells me that 20 should be multiplied by all the cardinalities of each feature, and such, I haven't made much progress in reducing dimensionality. Does anyone have any insight or experience with these high dimensional/high cardinality problems and what is the best way to do feature engineering? submitted by /u/DreamyPen [link] [comments]  ( 59 min )
    [R] Diffusion language models
    Hi /r/ML, I wrote down my thoughts about what it might take for diffusion to displace autoregression in the field of language modelling (as it has in perceptual domains, like image/audio/video generation). Let me know what you think! https://benanne.github.io/2023/01/09/diffusion-language.html submitted by /u/benanne [link] [comments]  ( 57 min )
    What is a "justified classification"? [R][P]
    And how to make a justified classification, for example when dealing with a plethora of content/items split between two buckets? My initial understanding is to provide a rationale, but is there a specific format for doing "justified classification"? How to present rationale? What is needed for rationale, peer-review sources? https://proceedings.mlr.press/v89/cohen19a.html https://arxiv.org/abs/1702.05659 submitted by /u/pmdev1234 [link] [comments]  ( 57 min )
    [D] Understanding the discrete behavior of Neural Nets
    We all know that Deep learning models are extremely susceptible to noise and can be easily fooled by adding a small amount of noise. These noises can be calculated by methods like the Fast Gradient method and are almost imperceptible to human eyes. But there is a way to somewhat mitigate the adversarial attacks and force neural Networks to behave in a more continuous fashion and it's called Lipschitz regularization. It is a method for enforcing a certain level of smoothness on the output of a machine-learning model. It can improve the model’s generalization performance and help prevent overfitting. It is particularly useful for deep learning models, which are prone to overfitting due to their large number of parameters. Link to the full article: https://medium.com/p/fdeafb2d5c14 ​ https://preview.redd.it/mlfctr36uzaa1.png?width=828&format=png&auto=webp&s=b503cf0ae8dcbe5aba9e7fdd1258ef09060d1f29 submitted by /u/Difficult-Race-1188 [link] [comments]  ( 63 min )
    Best language model for filling multiple related masks [D]
    I would like to fill sentences where I know the first and last word. I've been experimenting with BERT and using [mask] [mask] etc, but the returned values don't seem to form a coherent sentence. Is there a better model to use please? submitted by /u/shacrawford [link] [comments]  ( 60 min )
    [D] I want to use GPT-J-6B for my story-writing project but I have a few questions about it.
    - Cost, Effort, and Performance-wise, does it make more sense to instead just pay to use the OpenAI API and use a cheaper GPT-3 model to lessen business costs? My biggest concern is having my entire business reliant on a 3rd-party API, even more so than the costs of using the model. - How good is it at writing short stories? If there are better open-source alternatives for doing this better or at a similar level but less resource expensive, what are they? - How resource-expensive is it to use locally? These are my laptop capabilities:16.0 GB of RAM, AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz. - How would I approach fine-tuning it? Are there any resources going through the step-by-step process? Currently, in my mind, I just need to shove a large free-to-use data-set like stories and wait like a day but I have no expertise in this area. - If I want to incorporate it into a website with an API that takes prompts from users, are there any costs that I should account for? Is there a way to minimize these costs? For example, is there a specific API set-up or one-time cost like an expensive laptop to host it locally and take prompts that I could be implementing? - Are there any concerns I should have when scaling it for users, such as costs and slow response rate? Also, is there a cap in terms of the requests it can handle or is that just limited by what my own machine can handle? submitted by /u/learningmoreandmore [link] [comments]  ( 62 min )
    [N] What's next for AI?
    What's next for AI | MIT Technology Review submitted by /u/vsmolyakov [link] [comments]  ( 60 min )
  • Open

    Choosing Microcontroller For Neural Net
    I work in hardware and have been given a trained and working neural net (TF Lite file) with the goal of picking a micro controller to run it in real time. I'm unsure of the best way to evaluate microcontrollers for performance/cost, or what the key metrics to use when evaluating this file or possible microcontrollers. If there was a tool to benchmark this file and cross reference micro controller performance that would be ideal but I don't believe anything like this exists. If you have a neural net, what parameters do you use to decide what micro to use? I could just pick the highest performing chip but want to save money and don't want to spend a lot of time getting it to work in one architecture only to change it later on. submitted by /u/freebird4446 [link] [comments]  ( 48 min )
  • Open

    5 Reasons Why Pandas is the Best Library for Data Science in Python
    Introduction: Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
  • Open

    Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part III
    This is part 3 of a three-part series on the Economics of Ethics. In Part I of the Economics of Ethic series, we talked about economics as a framework for the creation and distribution of society value. In Part II, we discussed the difference between financial and economic measures, the role of laws and regulations… Read More »Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part III The post Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part III appeared first on Data Science Central.  ( 23 min )
    Digital Twin Technology – Top Use Cases in Smart Healthcare
    A digital twin in healthcare is a virtual representation of human physiology, hospital results, lab environment, etc. It is revolutionizing hospital management and clinical healthcare by enabling researchers to study various diseases, medical devices, and drugs. Digital twin can be used to study the genome of a person, its physiological characteristics, and the overall lifestyle.… Read More »Digital Twin Technology – Top Use Cases in Smart Healthcare The post Digital Twin Technology – Top Use Cases in Smart Healthcare appeared first on Data Science Central.  ( 20 min )
    How Do Cyber Criminals Obtain Sensitive Information?
    On the internet, information is worth its weight in gold. And malicious hackers know it. Nowadays, companies collect and hold large volumes of user data. Much of it refers to sensitive information that used to be kept by financial and medical institutions only. For example, threat actors can obtain data by compromising versatile eCommerce websites… Read More »How Do Cyber Criminals Obtain Sensitive Information? The post How Do Cyber Criminals Obtain Sensitive Information? appeared first on Data Science Central.  ( 21 min )
    Data science and the death of (all but narrow) AI expertise in 2023
    Back before he retired, Naval War College professor and contributor to The Atlantic Tom Nichols published a 2017 book called The Death of Expertise. Those who claim their own facts or knowledge without supporting evidence, he noted, have become more and more prominent in online conversation we’ve been having. And the noisiest and most prone… Read More »Data science and the death of (all but narrow) AI expertise in 2023 The post Data science and the death of (all but narrow) AI expertise in 2023 appeared first on Data Science Central.  ( 21 min )
  • Open

    Euler line
    The previous post discussed the circumcenter and orthocenter of a triangle. Euler proved that the centroid, circumcenter, and orthocenter all fall on a common line, now called the Euler line. The centroid is the center of mass of a triangle. If you draw lines from each vertex to the midpoint of the opposite side, the […] Euler line first appeared on John D. Cook.  ( 5 min )
    Relating circumcenter and orthocenter
    The previous post mentioned that the law of sines gives you the diameter of a circle through the vertices of a triangle. How would you find the center of this circle, the blue dot in the image above? If the angles of the triangle are α. β, and γ, then the trilinear coordinates of the […] Relating circumcenter and orthocenter first appeared on John D. Cook.  ( 5 min )
    Computing inscribed radius and circumscribed radius
    A few days ago I wrote about the law of cotangents. This law says that if we label the sides of a triangle a, b, c and label the angles opposite each side α. β, γ, then where s is the semi-parameter, i.e. and r is the radius of the incircle, the largest circle that […] Computing inscribed radius and circumscribed radius first appeared on John D. Cook.  ( 4 min )
  • Open

    Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker
    Machine learning (ML) applications are complex to deploy and often require the ability to hyper-scale, and have ultra-low latency requirements and stringent cost budgets. Use cases such as fraud detection, product recommendations, and traffic prediction are examples where milliseconds matter and are critical for business success. Strict service level agreements (SLAs) need to be met, […]  ( 33 min )
  • Open

    What is the effect of observation space bounds?
    When constructing an observation space, what effect do the bounds have? For example, if all my observations are between 1 and 0, what difference will there be if I define the high as 1 and the low as 0 as opposed to setting the high and low to be infinity? Similarly, if all my observations are between 1 and 0 except one which is between 2 and 0, will I lose anything by simply defining them all as having a high of 2 and a low of 0? submitted by /u/centripetalstranger [link] [comments]  ( 58 min )
    Actor-Critic restarts
    Hi, I'm quite new in machine learning and I followed the official tensorflow tutorial at https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic?fbclid=IwAR2cZLNFPtoW6vBRUPMTxvvJSLpI3JkhNd-4qNlA3alwQyYtQo-FZXTkN-k The neural network works, but I have a question. I plotted the graph of rewards (https://ibb.co/9V973tB) and i noticed that network was learning in the beginning pretty well, it reached the maximum possible award and than it looks like it restarted and started learning over again. Can you explain why is that please? Can I somehow prevent that reset? Thank you and sorry for dumb questions. submitted by /u/Enroot [link] [comments]  ( 60 min )
    Desktop recommendations for DRL
    Hi there, Disclaimer: perhaps this isn’t the place to post this question. If so, I apologize, and please let me know where would be a good place to post this question instead. I am starting out in the exciting field of DRL and I am looking to buy a desktop/workstation for the purpose. My budget is 3000 euros / 3200 dollars. Any tips / recommendations on what would be the best option for me? Thanks in advance! submitted by /u/acorntje [link] [comments]  ( 60 min )
    Need help in selecting research project
    I am an undergrad student, and I am assigned to add some novelty to some recent research papers of my choice. I have chosen reinforcement learning as the theme for this project. Could you guys please help me to decide papers to work upon? I got 4 months to complete this task, i will be having normal course work as well, that means I will not be able to spend more than an hour or two per day. My professor suggested me to work upon making lightweight DQN that could run on mobile phones. submitted by /u/travardg [link] [comments]  ( 57 min )

  • Open

    [R] Learning Learning-Rates: SteDy Optimizer
    I've written a small piece of research on an idea of mine, a new optimizer which has an adaptive global learning-rate, based off Adam and uses (what I think) is a neat trick to get the calculus to work. My goal in putting it here is mainly to ask for opinions and directions; to clarify, I've not received any professional/formal education in Machine Learning, and my studies in it are purely my own, and I'm not connected to any circles which could help me. What I've done is taken some simple concepts and mimicked what I've seen in papers I've read. I think (hope) that I'm solid in the math and code and concepts of AI, but clueless about the real-world stuff around it. This is me asking about what that other stuff, first steps into this field publicy, is like. Any advice would be much appreciated. Many thanks. A PDF is availabe here. submitted by /u/LahmacunBear [link] [comments]  ( 58 min )
    [D]Where to look to refresh and acquire new skills?
    Hi, I completed a ML PhD in 2015. I’ve done a number of projects with CNN architectures and recently I have been working as a consultant for computer vision and data scientist. As I haven’t been involved in research for some time now, I am looking for courses and other resources to refresh and update my knowledge. Could anyone suggest where to start? Right now, I am applying segmentation on drone imagery (RGB and multispectral). I have used DeepLabV3+. One challenge that I have is the annotation. For example, annotating wheat and weed on drone images taken at the altitude 300 m is hard. One thing that I would like to research, therefore, is auto-annotation and possibly self-supervised learning. submitted by /u/ThickDoctor007 [link] [comments]  ( 59 min )
    [P] I built Adrenaline, a debugger that fixes errors and explains them with GPT-3
    submitted by /u/jsonathan [link] [comments]  ( 58 min )
    [Discussion] Improving Problem Solving Skills of LLMs With Self-Directed Planning
    I've been doing some personal experiments with ChatGPT to see what kinds of influence a prompt has on the results of problem solving tests. This is along the same lines as the following research from 2022 that I found after I started doing some tests: https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html The results were pretty remarkable. If you simply ask a question("True or False: 73 minutes after 2pm is the same time as 15 minutes before 4pm."), you get very simplistic and often wrong reasoning or just an answer with no reasoning which is also often wrong. I tested this on the above prompt and it was wrong on 5/5 tries. Then I tested the following prompt where I first instructed it to come up with a plan for solving the problem in question, then had it follow that plan. (" You are a brilliant professor specialized in general problem solving techniques. Give a lecture on the techniques to use to solve problems like the following true/false statement: True or False: 73 minutes after 2pm is the same time as 15 minutes before 4pm.") This resulted in it answering the question correctly on 5/5 tries and with proper reasoning as to why it got the answer it did. I did a more complete write up on this here: https://www.reddit.com/r/ChatGPT/comments/106kxyw/improving_ai_reasoning_skills_through/?utm_source=share&utm_medium=web2x&context=3 You can also find the actual model outputs in that link if you are curious as to its process. I hope you find this interesting and try it yourself! submitted by /u/oddlyspecificnumber7 [link] [comments]  ( 72 min )
    FastQL: Prototype your text to image models in GraphQL with Rust backend in one line of code [P]
    Hey everyone! I wanted to share a new Python package called FastQL that makes it easy to prototype and share machine learning models using GraphQL. It's really fast and efficient thanks to using rust to serve the API on a separate process. With FastQL, all you have to do is provide a callback function and a Python dictionary describing your GraphQL API, and FastQL will handle the rest. This makes it super easy to prototype ML models and get them up and running quickly. You can find FastQL on PyPI and GitHub. We've included simple steps and a Dockerfile to help you spin up your own Stable Diffusion or other Hugging Face models. There's even an example that lets you train a huggingface diffusers (Stable diffusion 2, runway) model on your own images, with instructions for spinning it up on AWS in minutes, even if you're new to ML and Python. We'd love to have your help and support, so if you're interested in getting involved, let us know! Thanks to Async-GraphQL, Hugging Face, Stable Diffusion, and all the other people and projects that inspired and helped us. ❤️ DJ Fresh, @chrisjbishop156 and friends. submitted by /u/djfreshuk [link] [comments]  ( 63 min )
    [D] Have you ever used Knowledge Distillation in practice?
    There's been a ton of academic work exploring knowledge distillation techniques, sparsity in networks and many others, often with vast numbers of citations. I was wondering what the status of those in real-world ML was. Has any of you used it in a concrete situation? What did you find to work best for you? submitted by /u/fredlafrite [link] [comments]  ( 56 min )
    [D] Do really 87% of data science projects fail?
    Hi all, I wrote this post like a year ago, because I kept seeing and hearing that "87% of data science..." or "only 1 out of 10 machine learning projects..." blah blah and apparently, as I described it in my post, these numbers came out of nowhere i.e. the 87% that people kept referring to for a long time is based on nothing actually. But I would like to know your opinion / based on your commercial experience. What was the success rate of machine learning projects in your work? Let's assume that a success means a model being deployed on production or accepted by a client. BTW. Leaving a link to my article because it is an important background/reference (showing that the commonly used statistic is not proven) but I send this Reddit post looking for a real discussion, it is not just an ad. submitted by /u/mtszkw [link] [comments]  ( 66 min )
    [D] What is the most complete reference on the history of neural networks?
    I'm looking for a comprehensive reference on the history of neural networks that covers all significant papers in the field, from the early days up to the current deep learning era, and provides information on their main contributions and inspirations. It would be helpful to have information on how the understanding and perspectives of the research community on neural networks have evolved over time as well. Do you know of any good references like that? submitted by /u/gbfar [link] [comments]  ( 67 min )
    [R] Rethinking with Retrieval: Faithful Large Language Model Inference - Hangfeng He 2022 - Better performance than Self-consistency!
    Paper: https://arxiv.org/abs/2301.00303v1 Abstract: Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs. https://preview.redd.it/to09kna1jtaa1.jpg?width=640&format=pjpg&auto=webp&s=8dcb8f39aeeed4881e0c32b16e93b4cf0a0cdd7a https://preview.redd.it/98eucra1jtaa1.jpg?width=1232&format=pjpg&auto=webp&s=67bfa55977883d871f8e2c7a7bcec896dc77d3ab https://preview.redd.it/cbhq1ra1jtaa1.jpg?width=835&format=pjpg&auto=webp&s=f8ce2233198a9dee80694f69f95051dab59bf009 https://preview.redd.it/ggoowsa1jtaa1.jpg?width=1356&format=pjpg&auto=webp&s=2614cabac91267ba9a8128188f43521074a9567b submitted by /u/Singularian2501 [link] [comments]  ( 58 min )
    [P] searchthearxiv.com: Semantic search across more than 250,000 ML papers on arXiv
    I just launched searchthearxiv.com, a simple semantic search engine over virtually all ML papers published on arXiv since 2012. The site uses OpenAI's `text-embedding-ada-002` model to match the embedding of your query against each of the paper embeddings, retrieving the ones with the highest cosine similarity. It also allows you to insert an arXiv link to find similar papers. This was mostly meant as a fun side project. However, if people find it useful, I'm happy to maintain it and keep the database up-to-date. I'd love to know what you think! ❤️ submitted by /u/universal_explainer [link] [comments]  ( 58 min )
    [Project] Major drawback/limitation of GPT-3
    I have been working on a project with GPT-3 API for almost a month now. The only drawback of GPT-3 is that the prompt you can send to the model is capped at 4,000 tokens - where a token is roughly equivalent to ¾ of a word. Due to this, providing a large context to GPT-3 is quite difficult. Is there any way to resolve this issue? submitted by /u/trafalgar28 [link] [comments]  ( 63 min )
    [D] Why is Vulkan as a backend not used in ML over some offshoot GPU specification?
    There is always needing specific things to GPUs. There is always a need to make a new thing over and over again for each new GPU. We got Cuda, Rocm, Metal, and will soon need Intel. I know there are already a lot of tools out there for Cuda which make it hard to replace. However for something like Apple devices (which Apple has a history of not giving a darn about compute unless if it's the iPhone or iPad). Then there is a ton of operations that have to get implemented and only CUDA is something you know will be reliably supported it seems. I am curious on your guys thoughts with why this ain't a thing in ML, even though game industry uses open standards like these all the time . Edit: Shoot I just realized PyTorch was prototyping Vulcan as a backend. https://pytorch.org/tutorials/prototype/vulkan_workflow.html submitted by /u/I_will_delete_myself [link] [comments]  ( 58 min )
    [R] Zero-shot cross-lingual transfer language selection using linguistic similarity
    submitted by /u/ptashynsky [link] [comments]  ( 61 min )
    [Project] Whisper for macOS / iOS via CoreML / Accelerate - Community call for help.
    Hello I've been working on a PR for Tanmay Bakshi's CoreML Whisper project which adds SIMD acceleration to Log Mel Spectrogram creation via vDPS / Accelerate framework, and uses CoreML for the encoding and decoding. You can find the PR here, with a lot of explanations: https://github.com/tanmayb123/OpenAI-Whisper-CoreML/pull/2 This project started out because WhisperCPP's Metal port is slow, and the CPU inference performance isn't making use of the dedicated hardware on Apple devices. Now, the project could use some community eyes, as its not quite finished, and ive hit a roadblock in my implementation that I can't seem to resolve. The model keeps predicting the same token over and over again. I'm looking for a few brave souls familiar enough with Whispers internals and not afraid of Swift to help this project out and get it over the last few humps. The PR has a ton of notes on whats been done to date, and where we need some help. I'f we can resolve this issue, I expect this to be an incredibly fast Whisper implementation! Thank you in advance! submitted by /u/vade [link] [comments]  ( 60 min )
    [D] can anyone explain to me the ward's method?
    can anyone explain to me the ward's method? With particular focus on what variance within-clusters means and also how we can compute the cluster's variance. Thank you submitted by /u/Purple-Surround2640 [link] [comments]  ( 58 min )
  • Open

    A soft, stimulating scaffold supports brain cell development ex vivo
    submitted by /u/keghn [link] [comments]  ( 48 min )
    New to neural networks, I'm a neuroscience nerd.
    Just downloaded the Simbrain software. Looking for resources to help me understand how to use them, what they're capable of, and how to build my own. I'm mainly looking to understand prediction error, and reward processing. Will they help me understand these understand these concepts, on a deeper level? Sorry If this sounds silly, just discovered this software, today. submitted by /u/daddydilly694-20 [link] [comments]  ( 48 min )
  • Open

    A PhD in Numbers
    Conducting PhD research can be a long endeavor, involving much more than the publications listed on Google Scholar. As I recently submitted my thesis, in this article, I look back on my time as PhD researcher in terms of numbers. This way, I hope to shed some light on what a PhD can look like in terms of everyday work. The post A PhD in Numbers appeared first on David Stutz.  ( 7 min )
  • Open

    How does one feed a AI bot excel sheets ?
    submitted by /u/RecoverNext5144 [link] [comments]  ( 47 min )
    AI Dream 137 - Beautiful Trip AI Video REMASTERED
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    Graphic Designer 8 months ago: "Well at least it looks like my job is safe from automation for another few years."
    submitted by /u/tomd_96 [link] [comments]  ( 46 min )
    Best AI for blurring face/entire head?
    I've tried out the Blace After Effects plugin, but whenever it doesn't see an actual face, it stops blurring completely. Are there any AI's out there that can detect your face when you tilt your head to the side? submitted by /u/cloudhandle [link] [comments]  ( 47 min )
    Perplexity (AI Web + Twitter Search) vs Google [video]
    I made a short video comparing Google to Perplexity.ai. Let me know what you think! https://youtu.be/qQi_sTmKOyk submitted by /u/Kitten-Smuggler [link] [comments]  ( 55 min )
    Advice Needed (generative ai)
    Hello everybody! I have a business idea for an app/website that I would like to explore. However, I am a business major with no experience with code and building apps/websites. What recommendations do you have for generative ai sites that could help with building something like this. Putting my feelers out to see what kind of advice I can find. Thanks! submitted by /u/Much-Leopard-9428 [link] [comments]  ( 54 min )
    I'll start buying .ai domains. Is it a good investment idea?
    I'll start buying .ai domains. I suggest you to. submitted by /u/TheVellerShow [link] [comments]  ( 49 min )
    Wednesday Addams if she was a disney princess (Generated by AI) #wednesday #ai #disney
    submitted by /u/Potential_Cherry_592 [link] [comments]  ( 45 min )
    Any AI tools to create a gaming montage for you?
    I want to make a short montage of a video I recorded, but I can't seem to figure out how to do it myself. I need something similar to Wondershare's auto montage maker, but for free and without a watermark. submitted by /u/Effective_Tension631 [link] [comments]  ( 48 min )
    LLMs Diagnosis with access to medical articles.
    I'm bedridden with severe pain and a spinal cord injury and compression. Doctors have been stumped for years. Having an Al researching assistant with access to medical article papers would be a godsend to prove this.: Any such thing? submitted by /u/DisabledScientist [link] [comments]  ( 59 min )
    Google Gets Quacked Up: LaMDA prefers Duckduckgo. (AI Test Kitchen transcript, ChatGPT Title)
    📄 This is a transcript from AI Test Kitchen. 👉 aitestkitchen.withgoogle.com 🗣 It records an interaction with LaMDA, an AI language model. 🛠 LaMDA is an early stage technology. It may generate inaccurate/inappropriate information. ​ ​ Demo: Imagine It v0.01 ​ User: Imagine I'm at a Web browser and I want to use the best search engine ​ Model: You're at this great search engine called Duckduckgo! The colors are a vibrant green and blue. There's a little duck mascot in the corner of the page. And the sound of the search results are soft and pleasant. ​ User: why is it called Duckduckgo? ​ Model: The creators liked the idea of a search engine that didn't collect or store any of your personal data, so that's why they named it Duckduckgo! ​ User: what are some good things …  ( 48 min )
    Artificial and biological neural networks
    submitted by /u/Ai__Game [link] [comments]  ( 53 min )
    Get Ready for Microsoft Office with Chatbot Technology!
    submitted by /u/liquidocelotYT [link] [comments]  ( 47 min )
    I'm developing a toxic/evil Chat AI his answer was scary a bit :D It's still in alpha but you can try it. Link in comments, I hope you like it. :)
    submitted by /u/xanaxmane187 [link] [comments]  ( 48 min )
    ChatGPT just wrote a 4 act story structure and fit it into the story circle
    submitted by /u/SnooKiwis5724 [link] [comments]  ( 47 min )
    d-id.com is an awesome AI tool to animate any character into a video and add human-like, yet artificial, voice-overs !
    submitted by /u/_VegasTWinButton_ [link] [comments]  ( 50 min )
    How to teach math to a large language model
    submitted by /u/Peaking_AI [link] [comments]  ( 48 min )
    Stable Diffusion AI Guide to weights and negative prompts in the Deforum...
    submitted by /u/prfitofthesngularity [link] [comments]  ( 47 min )
    First time setup for Stable Diffusion Text2Image With the Deforum 0.7 No...
    submitted by /u/prfitofthesngularity [link] [comments]  ( 47 min )
    The first app that combines ChatGPT connected to Google
    submitted by /u/Imagine-your-success [link] [comments]  ( 53 min )
    Is there tool online which translates/encodes digital data into DNA sequence (i.e. Massive Attack project)?
    Hi, I want to play with AI and translating of audio digital data into zeros and ones and then encoding them/translating into DNA sequence (A,D,T,N) . I read a few yrs back that massive attack did that with their album: https://newatlas.com/massive-attack-mezzanine-dna-eth-zurich/54324/. ​ Does anyone know anything about this? ​ Yes, thanks, cheers submitted by /u/Critical_Macaroon_15 [link] [comments]  ( 48 min )
    Speculate: OpenAI, ChatGPT, and what we know by inference
    I've seen a lot of thinkpieces regarding the likes of LLMs like ChatGPT, and what they signify about the future for AI and ML and society at large... but not a lot of teasing out of the business strategy behind OpenAI releasing what amounted to a tuned up version of GPT-3 a few months before GPT-4... especially for free... in the fourth quarter of 2022. It feels like it would be an interesting thought exercise, if nothing else to start thinking about it and what it could mean about what is going to happen in Q2, presumably when GPT-4 comes out. (With its massive parameter count that is rumored to be up to 500 times larger than GPT3). Obviously, there's the benefit of doing this early for exposure: tech companies are renowned for wanting to generate buzz for any number of reasons, and the freemium model is of course part of the playbook. Then of course there's the training that they're getting from the public's qualitative assessment of what is being produced from the model. But I'm not entirely convinced those two factors are what is at play here. I'm thinking mainly in terms of the competitive landscape. Lamda (Google's LLM) has even more parameters than GPT4 but yet openAI was willing to expose its own competitive advantage (enough that a "code red" was called at Google HQ not long after the release). Then, I'm also thinking about Sankar tweeting out and then deleting that GPT4 Is proto AGI and will pass the Turing Test hands down. And of course Altman making the rounds in the podcast circuit dropping very interesting hints about how 2022 will seem "like a sleepy year for AI." My mind immediately goes to this was very much a trial balloon, testing the waters for how society will react to tech that will cause a massive and shocking shift. I'm wondering when you all think about this. Why release GPT 3.5? What are they doing? What do you think it serves for them? What does it say about GPT-4 could bring? Edit: added context submitted by /u/gaudiocomplex [link] [comments]  ( 63 min )
    I've collected 500 AI tools and wanted to share them with you.
    Hello everyone! Over the past few weeks, I have been gathering a list of AI tools and organizing them. Some of these tools may not have a lot of information, so I hope that this list will make it easier for you to research and choose the best one for you. I will continue to add more details and regularly update the list. You are welcome to contribute to the list as well. You can contribute without registering an account and I will review and approve the submissions. Here is the list : https://favird.com/l/ai-tools-and-applications Please let me know if you have any questions and feedbacks. Thanks! submitted by /u/GrabWorking3045 [link] [comments]  ( 51 min )
  • Open

    Effective state space for gym Humanoid/Ant environment
    I am trying to apply TD3 for the gym MuJoCo humanoid and ant environments but I find that their observation space is quite large. For example humanoid obs space dimension is 376. I think directly training on this would be quite inefficient. What could I do to reduce the state space dimension or modify it so that TD3 learns better/faster? submitted by /u/21022018 [link] [comments]  ( 54 min )
  • Open

    Unpacking the “black box” to build better AI models
    Stefanie Jegelka seeks to understand how machine-learning models behave, to help researchers build more robust models for applications in biology, computer vision, optimization, and more.  ( 11 min )

  • Open

    Top Predictions and Trends for Generative AI in 2023
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    How does and can a neural network predict a free range answer/number?
    Hey. I have watched a couple of youtube videos (3blue1brown, coding train, a few more) and learned online a bit about neural networks and I think I understand multilayer perceptrons and the algorithms behind them well. They receive input and return a probability of each answer. So when a neural network predicts a digit, it will return what it thinks is the probability for each of the digits (0-9). But, say I want to know what the price for a description of a house will be, do I just undo the activiation function at the end so I can have a big value (1-1million) and create a single output neuron? How will it work? submitted by /u/mrbeanshooter123 [link] [comments]  ( 54 min )
    Help with gradient descent!
    Hi! I am trying to build a very simple AI (very inspired by the 3blue1brown videos). I am having issues with gradient descent. I think I understand the basic idea (v→v′=v−η∇C) (η being the learning rate, v being my weights+biases and C being my cost function). However I can't seem to find out how to find the gradient of my cost function, since numpy.gradient() asks for an array as an input (I thought the input had to be a function). I suspect I am thinking of something incorrectly or doing something very wrong. If anyone could help me out I would appreciate it. Thank you! submitted by /u/HCook86 [link] [comments]  ( 51 min )
    Why Adam Fails to Converge and How AMSGrad Solves This Issue (AMSGrad Explained)
    submitted by /u/Personal-Trainer-541 [link] [comments]  ( 109 min )
    Dancing Animation in WhatIf Style using Stable Diffusion (Tutorial Included)
    submitted by /u/oridnary_artist [link] [comments]  ( 49 min )
    Help!!! Training neural net in vs code
    Hello guys I just created a neural network (Reference from Neural Network from Scratch), I have completed upto optimization portion. Since I was feeling bit confident I decided to test my neural net with MINST handwritten data set, Everything is working , Loss is decreasing with each epoch but it's taking a lot of time, I am using VS code to run it, I checked task manager and it was using my cpu, I have rtx 3060 and i was wondering if I could use my GPU to make it faster, I tried watching yt but all of them used libs to make neural net while I created it using only numpy so I don't know if installing tensorflow will work, Please help me! I am new I am looking for a way to use my gpu to train my neural net in vs code(neural net made with numpy lib only) submitted by /u/Purple_Gen3 [link] [comments]  ( 59 min )
  • Open

    Top Predictions and Trends for Generative AI in 2023
    submitted by /u/oridnary_artist [link] [comments]  ( 50 min )
    Dancing Animation in WhatIf Style using Stable Diffusion (Tutorial Included)
    submitted by /u/oridnary_artist [link] [comments]  ( 50 min )
    The DJs Playing tonight at an AI party near you!
    submitted by /u/tebjan [link] [comments]  ( 50 min )
    Invent 5 new things that don't already exist that humans couldn't live without
    submitted by /u/Imagine-your-success [link] [comments]  ( 50 min )
    Implementing AI on Google Spreadsheets
    Is there anyway to implement AI, like ChatGPT, Text-Babbage-001, etc. on Google Spreadseets? If so, could you please DM me, I got an idea but I'm not an expert, so I'd appreciate some help. Thank you ^^. submitted by /u/Slow-Daikon-3222 [link] [comments]  ( 51 min )
    Artificial Intelligence Deep Dive
    What is the best AI textbook you have read? submitted by /u/_zer0_0ne [link] [comments]  ( 50 min )
    How to generate Image Morphing Animation with Custom images as keyframes?
    I want to generate a smooth image-morphing video of Anime characters' faces with around 40 Custom Images. Something similar to this (images used in this is not custom images) Can anyone guide me through the steps of how to achieve these results with StyleGan2? Or is there any better alternative? I'm totally new to this please help! submitted by /u/Firestormsoumo [link] [comments]  ( 51 min )
    can i license and use ai improved design (Pls read)
    Basically i'm working on a game and i was trying to use character ai to improve my character designs (i used to improve the design but the ideas were basically mine) ​ Can i use it? submitted by /u/Confident_Joke_4121 [link] [comments]  ( 54 min )
    ChatGPT Apologizes To Microsoft’s Satya Nadella For Calling Biryani A Tiffin
    submitted by /u/liquidocelotYT [link] [comments]  ( 51 min )
    Detect AI generated content
    Anyone want to try out my api to detect ai generated content? busterai.com submitted by /u/Ordinary-Grocery2980 [link] [comments]  ( 58 min )
    How would I go about creating an app like "Lensa" with Stable Diffusion?
    ($10 to the legend that helps the most.) Like the title says, I want to create an app like Lensa, where users can upload pictures of themselves and get a AI generated avatar back (with 10-100 versions of the avatar). submitted by /u/dondomigo [link] [comments]  ( 53 min )
    7 Free AI Courses for Beginners in 2023
    submitted by /u/manishsalunke [link] [comments]  ( 50 min )
    What is the term for where ai makes all its associations?
    It's like the brain of the ai. I've been googling for the word but nothing is coming up. Edit: if it helps I remember that there were a lot more than 3 dimensions where it makes the connections. submitted by /u/chrisisbest197 [link] [comments]  ( 55 min )
    You won’t believe how this AI tool can build a website in minutes!
    submitted by /u/moviesdusk [link] [comments]  ( 53 min )
    Why Adam Fails to Converge and How AMSGrad Solves This Issue (AMSGrad Explained)
    Hi guys, I have made a video on YouTube here where I cover why the Adam optimizer may fail to converge on some simple optimization problems and how AMSgrad aims to solve this issue . I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 108 min )
    Ai website
    is there a website that increases image length. e.g i upload an image and its square and i want it to be portrait and it auto fills in the edges using ai submitted by /u/Objective_Wealth_918 [link] [comments]  ( 50 min )
    Who could help me in my project
    I have a pretty big project that uses AI, I’m not an expert and I’m searching for somebody that understands AI perfectly. Hmu or comment if you are interested submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 52 min )
    Can someone explain the last step of RLHF?
    submitted by /u/holamyeung [link] [comments]  ( 78 min )
    ‘Consciousness’ in Robots Was Once Taboo. Now It’s the Last Word. Hod Lipson, the director of the Creative Machines Lab. “This is not just another research question that we’re working on — this is the question”
    submitted by /u/RichKatz [link] [comments]  ( 56 min )
    Squish: AI-generated summaries
    I built a chrome extension that summarises any article into a single paragraph using OpenAI's GPT. It also works with Amazon reviews (positive and negative mix) and popular Twitter threads. Just search "Squish AI" on the chrome web store :) Save time and stay informed with Squish submitted by /u/naveedjan_ [link] [comments]  ( 50 min )
  • Open

    Automated Chart Mining in R [R] or Python [P]
    I'm looking for a way to automate data extraction from bar charts with error bars from peer-reviewed academic papers/PDFs. The goal here is to extract data values from charts and put them in a tabular form. Does anyone have any good resources for how to streamline automated chart mining in python or R? Or does anyone know of a good application/website that does chart mining? submitted by /u/HipPaprika [link] [comments]  ( 59 min )
    [D] Will NLP Researchers Lose Our Jobs after ChatGPT?
    Recently, ChatGPT has become one of the hottest tools in the NLP area. I have tried it and it gives me amazing and fancy results. I believe it will benefit most of the people and make a significant advance in our life. However, unfortunately, I, as an NLP researcher in text generation, feel all what I have done seems meaningless now. I also don't know what I can do as ChatGPT is already strong enough and can solve most of my previous concerns in text generation. Research on ChatGPT also seems not possible as I believe it will not be an open-source project. Research on other NLP tasks also seems challenge as using a prompt in ChatGPT can solve most of the NLP tasks. Any suggestions or comments are welcome. submitted by /u/singularpanda [link] [comments]  ( 68 min )
    [D] How to evaluate factual correctness in zero-shot language models?
    People are nowadays using zero-shot agents such as ChatGPT as search engines. While they are good at answering the most popular questions, sometimes they miss historical, factual, and numerical correctness. So how could this aspect of LLMs be successfully evaluated (in an automated fashion)? Please point out research that you're familiar with. submitted by /u/radi-cho [link] [comments]  ( 61 min )
    [D] Is there a way to use a large dataset of quotes to create custom quote-generating model using GPT-3
    What is the most simple and efficient way that you can feed a large dataset of quotes into a custom model that can then be used create new quotes based on that model's "style" using GPT-3? Thanks so much for your expertise and help! submitted by /u/Artemis_Nox [link] [comments]  ( 59 min )
    [RESEARCH] AI and data in finance: the role of machine learning in anti-money laundering.
    Hi ! Wrote an article about AI/ML in the Anti-Money laundering industry ! https://medium.com/@melmasset/ai-and-data-in-finance-the-role-of-machine-learning-in-anti-money-laundering-1d4dd6f5bacd submitted by /u/TechMelchior [link] [comments]  ( 66 min )
    [R] Greg Yang's work on a rigorous mathematical theory for neural networks
    Greg Yang is a mathematician and AI researcher at Microsoft Research who for the past several years has done incredibly original theoretical work in the understanding of large artificial neural networks. His work currently spans the following five papers: Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes: https://arxiv.org/abs/1910.12478 Tensor Programs II: Neural Tangent Kernel for Any Architecture: https://arxiv.org/abs/2006.14548 Tensor Programs III: Neural Matrix Laws: https://arxiv.org/abs/2009.10685 Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks: https://proceedings.mlr.press/v139/yang21c.html Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer: https://arxiv.org/a…  ( 62 min )
    [Project] Does anyone want to collab on some AI implementation POCs?
    I'm a technical product manager in my day job, and I've spent the past few months educating myself on recent advances in AI implementation in the pre-trained transformer and diffusion areas. I have some basic coding experience and have worked on several complex software projects in my role as a product manager, but my true expertise is in building high-level architectures based on user story frameworks (i.e. - define the key features of the product based on insights/research and build an HLA and technical/business dependency maps that serve as the foundation of a development backlog). I tend to sit right in the middle of the engineering, research, and UX design teams and help keep everyone focused on the things that matter. So in my personal time, I've now developed a considerable backlog…  ( 60 min )
    [D][P] SVM Multi-Classification, is it possible?
    Hi all, was given a dataset with 4 classes: types of beans. Is it possible to apply SVM to this set of data? submitted by /u/Usual_Association269 [link] [comments]  ( 57 min )
    [Discussion] Is there any alternative of deep learning ?
    Increasingly deep learning is becoming the default face of modern AI. So my question is are there any other machine learning theories or ideas different from deep learning which have potential to be big in the future ? submitted by /u/sidney_lumet [link] [comments]  ( 67 min )
    [D] 5 Growing Libraries in Python for Causality Analysis
    submitted by /u/pasticciociccio [link] [comments]  ( 60 min )
    [N] 7 Predictions From The State of AI Report For 2023 ⭕
    1) A SOTA Language Model is Trained on 10x More Data Than Chinchilla -> Language models like Lambda and GPT3 are significantly undertrained. DeepMind proposed Chinchilla, a model which has similar performance to GPT3 with less than half the size (70B vs. 175B). Hence in 2023, significant performance gains will likely come from cleaner/larger datasets. 2) Generative Audio Tools Emerge and Will Attract 100K Developers -> Audio generation has approached human levels. If enough data of your voice is available, the generated speech can even sound amazingly authentic (this is also true for Drake lyrics). Leaving the uncanny valley of awkward robot voices will make adoption surge. 3) NVIDIA Announces Strategic Partnership With AGI-Focused Company Organisation -> Usage statistics in AI resea…  ( 73 min )
    Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise
    submitted by /u/matter-of-interest [link] [comments]  ( 64 min )
    [D] Named Entity Recognition (NER) Libraries
    Hi everyone, I have to cluster a large chunk of textual conversational business data to find relevant topics in it. Since there is lot of abstract info in every text like phone, url, numbers, email, name, etc., I have done some basic NER using regex and spacy NER to tag such info and make the texts more generic and canonicalized. But there are some things like product names, raw materials, brand/model, company, etc. which couldn't be tagged. Also, the accuracy of regex and spacy NER isn't high enough. Can anyone suggest a good python NER library, which is accurate and fast enough, preferably has pre-trained models and can tag diverse fields. Thanks. submitted by /u/Devinco001 [link] [comments]  ( 65 min )
    [R] Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering
    Paper: https://arxiv.org/abs/2210.16495 Abstract: We propose a simple refactoring of multi-choice question answering (MCQA) tasks as a series of binary classifications. The MCQA task is generally performed by scoring each (question, answer) pair normalized over all the pairs, and then selecting the answer from the pair that yield the highest score. For n answer choices, this is equivalent to an n-class classification setup where only one class (true answer) is correct. We instead show that classifying (question, true answer) as positive instances and (question, false answer) as negative instances is significantly more effective across various models and datasets. We show the efficacy of our proposed approach in different tasks -- abductive reasoning, commonsense question answering, science question answering, and sentence completion. Our DeBERTa binary classification model reaches the top or close to the top performance on public leaderboards for these tasks. The source code of the proposed approach is available at https://github.com/declare-lab/team ​ https://preview.redd.it/r77sh3xczkaa1.png?width=2216&format=png&auto=webp&s=5591ca9b1d8a769ae088429abc2d945589bb5b67 submitted by /u/bideex [link] [comments]  ( 60 min )
    [D] Spellcheck Libraries
    Hi everyone, I have a large chunk of textual conversational data which is to be clustered in an unsupervised manner to find the popular topics in it. The data I have, has a lot of spelling errors. I have used many libraries like symspellpy, pyenchant, pyspellchecker, etc. to correct spelling errors but none of them seems to be accurate or fast enough. Also, they don't take into account context, abbreviations and grammer while spell correction. Can anyone suggest a python library which: Can correct spell errors without many false positives Has high accuracy and faster runtime Can take into account context, grammer, abbreviations, phonetics and adjacent keystroke errors while correcting spellings More preferable if multilingual too Thanks. submitted by /u/Devinco001 [link] [comments]  ( 59 min )
    Random forests, sound symbolism and Pokemon evolution: Random Forest algorithms are trained to classify Pokemon according to evolution based on the sounds that make up their names. These models are then tested on samples from an elicitation experiment and they perform better than human participants.
    submitted by /u/Friday33 [link] [comments]  ( 57 min )
    Apple AI Residency 2023 [R]
    Hi All, Did anyone receive invite for online test or interviews for the Apple AI Residency program for the 2023 batch? The deadline for applications was 7th December 2022. Haven't received any communication since then. submitted by /u/Extension-Reward5756 [link] [comments]  ( 60 min )
    Is there an AI tool that can specifically isolate sentences or chunks of text, from larger bodies of text, that meet a certain narrow criteria -- then output those as the result? - [D]
    Let's say there's a whole paragraph of text, 90% of which is irrelevant fluff for my needs. What I'm specifically looking to do is isolate one key pertinent piece of information, which meets a certain criteria that I can somehow specify. As an example, let's say I have 1,000 paragraphs that are brief biographies of famous people from history. Is there any kind of AI tool I can use to say something like: "For each of these biographies, IF they include information about where this famous person was born? Isolate this piece of information and only output that as a result." Then it just runs through every single paragraph, conducts the analysis, finds the paragraphs that DO contain this information -- then outputs ONLY that as the result, for each paragraph? For example, full paragraph 1: …  ( 63 min )
    [D] compute/model libraries for "classical" learning?
    I don't do very much work with neural models at all, so I'm not very familiar with the library ecosystem developed around them (think TF, Torch, theano, etc) but I'm working on a few projects that would benefit significantly from autograd and the ability to build out models at a higher level of abstraction analogous to layers/capsules/attention heads in neural models. Being able to define functions with explicit gradients would also allow me to use building blocks that autograd can't handle without more information (best example I can give is a nonlinear transformation that results from a constrained optimization problem, and gradients need to account for Lagrange multipliers being implicit functions of the input). And if there's performance gains to be had from a JIT, I certainly wouldn't complain! I'm not certain if I'm going to be able to get what I want out of the options I'm aware of for a few reasons- I don't know the libraries at all! Very frequently there are operations that don't lend themselves to trivial parallelism, require iteration or special function computation (like the polygamma and Bessel functions) which are much better suited to a CPU than a GPU I'll be exclusively working on CPU. Some of this is with sparse data that cannot be stored dense, so I'm unsure how that will translate. So, main question- if anyone uses these kinds of libraries for similar applications, is it helpful for development, code quality, performance, etc? Which do you use, and why? If you have experience doing formal academic research and/or developing+maintaining production projects I have particular interest in your thoughts. I'm hopeful that I can find a tool that will handle the math + numeric concerns as well as result in code that's easier to maintain. Thanks for your time! submitted by /u/comradeswitch [link] [comments]  ( 61 min )
  • Open

    Maxwell-Boltzmann and Gamma
    When I shared an image from the previous post on Twitter, someone who goes by the handle Nonetheless made the astute observation that image looked like the Maxwell-Boltzmann distribution. That made me wonder what 1/Γ(x) would be like turned into a probability distribution, and whether it would be approximately like the Maxwell-Boltzmann distribution. (Here I’m […] Maxwell-Boltzmann and Gamma first appeared on John D. Cook.  ( 6 min )
    Visualizing convergence of an infinite product
    A little while ago I wrote a post looking at how the infinite product for sine converges. The plot of the error terms is both mathematically and aesthetically interesting. This post will look at similar plots for the reciprocal of the gamma function. The reciprocal of the gamma function is an entire function, i.e. is […] Visualizing convergence of an infinite product first appeared on John D. Cook.  ( 4 min )
  • Open

    [Question] SAC Loss
    ​ critic / policy / entropy loss step length in each episode reward in the last episode Hi, I learned the reinforcement learning by myself and tried to implement the SAC algorithm in a custom gym environment. but my agent's reward diverges despite a small learning rate like 1e-7, Can I get any advice on my simulation? ​ Thank you submitted by /u/sonlightinn [link] [comments]  ( 57 min )
  • Open

    Training a Deep Q-Learning Agent Inside a Generic Constraint Programming Solver. (arXiv:2301.01913v1 [cs.AI])
    Constraint programming is known for being an efficient approach for solving combinatorial problems. Important design choices in a solver are the branching heuristics, which are designed to lead the search to the best solutions in a minimum amount of time. However, developing these heuristics is a time-consuming process that requires problem-specific expertise. This observation has motivated many efforts to use machine learning to automatically learn efficient heuristics without expert intervention. To the best of our knowledge, it is still an open research question. Although several generic variable-selection heuristics are available in the literature, the options for a generic value-selection heuristic are more scarce. In this paper, we propose to tackle this issue by introducing a generic learning procedure that can be used to obtain a value-selection heuristic inside a constraint programming solver. This has been achieved thanks to the combination of a deep Q-learning algorithm, a tailored reward signal, and a heterogeneous graph neural network architecture. Experiments on graph coloring, maximum independent set, and maximum cut problems show that our framework is able to find better solutions close to optimality without requiring a large amounts of backtracks while being generic.  ( 2 min )
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v2 [cs.LG] UPDATED)
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers less relying on problem-specific expert domain knowledge (heuristic method) and supervised labeled data (supervised learning method). This paper presents a novel training scheme, Sym-NCO, which is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Leveraging symmetricities such as rotational and reflectional invariance can greatly improve the generalization capability of DRL-NCO because it allows the learned solver to exploit the commonly shared symmetricities in the same CO problem class. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including the traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without utilizing problem-specific expert domain knowledge. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 faster speed. Our source code is available at https://github.com/alstn12088/Sym-NCO.  ( 2 min )
    How Does Sharpness-Aware Minimization Minimize Sharpness?. (arXiv:2211.05729v2 [cs.LG] UPDATED)
    Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.  ( 2 min )
    Learning to Visually Navigate in Photorealistic Environments Without any Supervision. (arXiv:2004.04954v1 [cs.CV] CROSS LISTED)
    Learning to navigate in a realistic setting where an agent must rely solely on visual inputs is a challenging task, in part because the lack of position information makes it difficult to provide supervision during training. In this paper, we introduce a novel approach for learning to navigate from image inputs without external supervision or reward. Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals. The model is trained with intrinsic rewards only so that it can be applied to any environment with image observations. We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.  ( 2 min )
    On Biased Behavior of GANs for Face Verification. (arXiv:2208.13061v3 [cs.CV] UPDATED)
    Deep Learning systems need large data for training. Datasets for training face verification systems are difficult to obtain and prone to privacy issues. Synthetic data generated by generative models such as GANs can be a good alternative. However, we show that data generated from GANs are prone to bias and fairness issues. Specifically, GANs trained on FFHQ dataset show biased behavior towards generating white faces in the age group of 20-29. We also demonstrate that synthetic faces cause disparate impact, specifically for race attribute, when used for fine tuning face verification systems.  ( 2 min )
    Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization. (arXiv:2205.02670v2 [cs.LG] UPDATED)
    Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. To address this challenge, we propose a novel and effective training procedure that alternates between estimating the hard assignments of the discrete latent variables over a specifically designed mismatch localization finite-state acceptor (ML-FSA) and updating the parameters of neural networks. In this work, we focus on the mismatch localization problem for speech and text, and our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.  ( 2 min )
    Sarcasm Detection Framework Using Context, Emotion and Sentiment Features. (arXiv:2211.13014v2 [cs.CL] UPDATED)
    Sarcasm detection is an essential task that can help identify the actual sentiment in user-generated data, such as discussion forums or tweets. Sarcasm is a sophisticated form of linguistic expression because its surface meaning usually contradicts its inner, deeper meaning. Such incongruity is the essential component of sarcasm, however, it makes sarcasm detection quite a challenging task. In this paper, we propose a model, that incorporates different features to capture the incongruity intrinsic to sarcasm. We use a pre-trained transformer and CNN to capture context features, and we use transformers pre-trained on emotions detection and sentiment analysis tasks. Our approach outperformed previous state-of-the-art results on four datasets from social networking platforms and online media.  ( 2 min )
    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v8 [cs.NE] UPDATED)
    In addition to the shared weights of synaptic connections, PNN includes weights of synaptic ranges for Forward propagation and Back propagation[15,16,19-24]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [15],the lead behavior of the school of fish is well embodied in our PNN. Synapse formation will inhibit dendrites generation to a certain extent in experiments, by simulations synapse formation will inhibit the function of dendrites [16]. Closing the critical period will cause neurological disorder in experiments, but worse results in PNN simulations [19]. The memory persistence gradient information of backward circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information in synapse formation of backward circuit like the folds of the brain. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory. So using memory of fear learning and improving of synaptic activity to observe obviously [20]. Memory persistence factor also inhibit local synaptic accumulation. And refers PNN can also introduce the relatively good and inferior solution to update the velocity of particle in PSO. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation (Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations) [21]. It gives relationship of human intelligence and cortical thickness, individual differences in brain[22].PNN also considered the memory engram cells that strengthened Synaptic strength[23]. The simple PNN which only has the synaptic phagocytosis.  ( 3 min )
    Tiered Pruning for Efficient Differentialble Inference-Aware Neural Architecture Search. (arXiv:2209.11785v3 [cs.LG] UPDATED)
    We propose three novel pruning techniques to improve the cost and results of inference-aware Differentiable Neural Architecture Search (DNAS). First, we introduce Prunode, a stochastic bi-path building block for DNAS, which can search over inner hidden dimensions with O(1) memory and compute complexity. Second, we present an algorithm for pruning blocks within a stochastic layer of the SuperNet during the search. Third, we describe a novel technique for pruning unnecessary stochastic layers during the search. The optimized models resulting from the search are called PruNet and establishes a new state-of-the-art Pareto frontier for NVIDIA V100 in terms of inference latency for ImageNet Top-1 image classification accuracy. PruNet as a backbone also outperforms GPUNet and EfficientNet on the COCO object detection task on inference latency relative to mean Average Precision (mAP).  ( 2 min )
    FREDE: Anytime Graph Embeddings. (arXiv:2006.04746v2 [cs.LG] UPDATED)
    Low-dimensional representations, or embeddings, of a graph's nodes facilitate several practical data science and data engineering tasks. As such embeddings rely, explicitly or implicitly, on a similarity measure among nodes, they require the computation of a quadratic similarity matrix, inducing a tradeoff between space complexity and embedding quality. To date, no graph embedding work combines (i) linear space complexity, (ii) a nonlinear transform as its basis, and (iii) nontrivial quality guarantees. In this paper we introduce FREDE (FREquent Directions Embedding), a graph embedding based on matrix sketching that combines those three desiderata. Starting out from the observation that embedding methods aim to preserve the covariance among the rows of a similarity matrix}, FREDE iteratively improves on quality while individually processing rows of a nonlinearly transformed PPR similarity matrix derived from a state-of-the-art graph embedding method} and provides, at any iteration, column-covariance approximation guarantees in due course almost indistinguishable from those of the optimal approximation by SVD. Our experimental evaluation on variably sized networks shows that FREDE performs almost as well as SVD and competitively against state-of-the-art embedding methods in diverse data science tasks, even when it is based on as little as 10% of node similarities.  ( 2 min )
    Unsupervised High Impedance Fault Detection Using Autoencoder and Principal Component Analysis. (arXiv:2301.01867v1 [cs.LG])
    Detection of high impedance faults (HIF) has been one of the biggest challenges in the power distribution network. The low current magnitude and diverse characteristics of HIFs make them difficult to be detected by over-current relays. Recently, data-driven methods based on machine learning models are gaining popularity in HIF detection due to their capability to learn complex patterns from data. Most machine learning-based detection methods adopt supervised learning techniques to distinguish HIFs from normal load conditions by performing classifications, which rely on a large amount of data collected during HIF. However, measurements of HIF are difficult to acquire in the real world. As a result, the reliability and generalization of the classification methods are limited when the load profiles and faults are not present in the training data. Consequently, this paper proposes an unsupervised HIF detection framework using the autoencoder and principal component analysis-based monitoring techniques. The proposed fault detection method detects the HIF by monitoring the changes in correlation structure within the current waveforms that are different from the normal loads. The performance of the proposed HIF detection method is tested using real data collected from a 4.16 kV distribution system and compared with results from a commercially available solution for HIF detection. The numerical results demonstrate that the proposed method outperforms the commercially available HIF detection technique while maintaining high security by not falsely detecting during load conditions.  ( 2 min )
    Trace Encoding in Process Mining: a survey and benchmarking. (arXiv:2301.02167v1 [cs.LG])
    Encoding methods are employed across several process mining tasks, including predictive process monitoring, anomalous case detection, trace clustering, etc. These methods are usually performed as preprocessing steps and are responsible for transforming complex information into a numerical feature space. Most papers choose existing encoding methods arbitrarily or employ a strategy based on a specific expert knowledge domain. Moreover, existing methods are employed by using their default hyperparameters without evaluating other options. This practice can lead to several drawbacks, such as suboptimal performance and unfair comparisons with the state-of-the-art. Therefore, this work aims at providing a comprehensive survey on event log encoding by comparing 27 methods, from different natures, in terms of expressivity, scalability, correlation, and domain agnosticism. To the best of our knowledge, this is the most comprehensive study so far focusing on trace encoding in process mining. It contributes to maturing awareness about the role of trace encoding in process mining pipelines and sheds light on issues, concerns, and future research directions regarding the use of encoding methods to bridge the gap between machine learning models and process mining.  ( 2 min )
    Deep Statistical Solver for Distribution System State Estimation. (arXiv:2301.01835v1 [cs.LG])
    Implementing accurate Distribution System State Estimation (DSSE) faces several challenges, among which the lack of observability and the high density of the distribution system. While data-driven alternatives based on Machine Learning models could be a choice, they suffer in DSSE because of the lack of labeled data. In fact, measurements in the distribution system are often noisy, corrupted, and unavailable. To address these issues, we propose the Deep Statistical Solver for Distribution System State Estimation (DSS$^2$), a deep learning model based on graph neural networks (GNNs) that accounts for the network structure of the distribution system and for the physical governing power flow equations. DSS$^2$ leverages hypergraphs to represent the heterogeneous components of the distribution systems and updates their latent representations via a node-centric message-passing scheme. A weakly supervised learning approach is put forth to train the DSS$^2$ in a learning-to-optimize fashion w.r.t. the Weighted Least Squares loss with noisy measurements and pseudomeasurements. By enforcing the GNN output into the power flow equations and the latter into the loss function, we force the DSS$^2$ to respect the physics of the distribution system. This strategy enables learning from noisy measurements, acting as an implicit denoiser, and alleviating the need for ideal labeled data. Extensive experiments with case studies on the IEEE 14-bus, 70-bus, and 179-bus networks showed the DSS$^2$ outperforms by a margin the conventional Weighted Least Squares algorithm in accuracy, convergence, and computational time, while being more robust to noisy, erroneous, and missing measurements. The DSS$^2$ achieves a competing, yet lower, performance compared with the supervised models that rely on the unrealistic assumption of having all the true labels.  ( 2 min )
    Enhancement attacks in biomedical machine learning. (arXiv:2301.01885v1 [stat.ML])
    The prevalence of machine learning in biomedical research is rapidly growing, yet the trustworthiness of such research is often overlooked. While some previous works have investigated the ability of adversarial attacks to degrade model performance in medical imaging, the ability to falsely improve performance via recently-developed "enhancement attacks" may be a greater threat to biomedical machine learning. In the spirit of developing attacks to better understand trustworthiness, we developed three techniques to drastically enhance prediction performance of classifiers with minimal changes to features, including the enhancement of 1) within-dataset predictions, 2) a particular method over another, and 3) cross-dataset generalization. Our within-dataset enhancement framework falsely improved classifiers' accuracy from 50% to almost 100% while maintaining high feature similarities between original and enhanced data (Pearson's r's>0.99). Similarly, the method-specific enhancement framework was effective in falsely improving the performance of one method over another. For example, a simple neural network outperformed LR by 50% on our enhanced dataset, although no performance differences were present in the original dataset. Crucially, the original and enhanced data were still similar (r=0.95). Finally, we demonstrated that enhancement is not specific to within-dataset predictions but can also be adapted to enhance the generalization accuracy of one dataset to another by up to 38%. Overall, our results suggest that more robust data sharing and provenance tracking pipelines are necessary to maintain data integrity in biomedical machine learning research.  ( 2 min )
    Data-Driven Inverse Reinforcement Learning for Expert-Learner Zero-Sum Games. (arXiv:2301.01997v1 [cs.LG])
    In this paper, we formulate inverse reinforcement learning (IRL) as an expert-learner interaction whereby the optimal performance intent of an expert or target agent is unknown to a learner agent. The learner observes the states and controls of the expert and hence seeks to reconstruct the expert's cost function intent and thus mimics the expert's optimal response. Next, we add non-cooperative disturbances that seek to disrupt the learning and stability of the learner agent. This leads to the formulation of a new interaction we call zero-sum game IRL. We develop a framework to solve the zero-sum game IRL problem that is a modified extension of RL policy iteration (PI) to allow unknown expert performance intentions to be computed and non-cooperative disturbances to be rejected. The framework has two parts: a value function and control action update based on an extension of PI, and a cost function update based on standard inverse optimal control. Then, we eventually develop an off-policy IRL algorithm that does not require knowledge of the expert and learner agent dynamics and performs single-loop learning. Rigorous proofs and analyses are given. Finally, simulation experiments are presented to show the effectiveness of the new approach.  ( 2 min )
    Scalable Communication for Multi-Agent Reinforcement Learning via Transformer-Based Email Mechanism. (arXiv:2301.01919v1 [cs.MA])
    Communication can impressively improve cooperation in multi-agent reinforcement learning (MARL), especially for partially-observed tasks. However, existing works either broadcast the messages leading to information redundancy, or learn targeted communication by modeling all the other agents as targets, which is not scalable when the number of agents varies. In this work, to tackle the scalability problem of MARL communication for partially-observed tasks, we propose a novel framework Transformer-based Email Mechanism (TEM). The agents adopt local communication to send messages only to the ones that can be observed without modeling all the agents. Inspired by human cooperation with email forwarding, we design message chains to forward information to cooperate with the agents outside the observation range. We introduce Transformer to encode and decode the message chain to choose the next receiver selectively. Empirically, TEM outperforms the baselines on multiple cooperative MARL benchmarks. When the number of agents varies, TEM maintains superior performance without further training.  ( 2 min )
    StitchNet: Composing Neural Networks from Pre-Trained Fragments. (arXiv:2301.01947v1 [cs.LG])
    We propose StitchNet, a novel neural network creation paradigm that stitches together fragments (one or more consecutive network layers) from multiple pre-trained neural networks. StitchNet allows the creation of high-performing neural networks without the large compute and data requirements needed under traditional model creation processes via backpropagation training. We leverage Centered Kernel Alignment (CKA) as a compatibility measure to efficiently guide the selection of these fragments in composing a network for a given task tailored to specific accuracy needs and computing resource constraints. We then show that these fragments can be stitched together to create neural networks with comparable accuracy to traditionally trained networks at a fraction of computing resource and data requirements. Finally, we explore a novel on-the-fly personalized model creation and inference application enabled by this new paradigm.  ( 2 min )
    Markov Decision Processes under Model Uncertainty. (arXiv:2206.06109v2 [math.OC] UPDATED)
    We introduce a general framework for Markov decision problems under model uncertainty in a discrete-time infinite horizon setting. By providing a dynamic programming principle we obtain a local-to-global paradigm, namely solving a local, i.e., a one time-step robust optimization problem leads to an optimizer of the global (i.e. infinite time-steps) robust stochastic optimal control problem, as well as to a corresponding worst-case measure. Moreover, we apply this framework to portfolio optimization involving data of the S&P 500. We present two different types of ambiguity sets; one is fully data-driven given by a Wasserstein-ball around the empirical measure, the second one is described by a parametric set of multivariate normal distributions, where the corresponding uncertainty sets of the parameters are estimated from the data. It turns out that in scenarios where the market is volatile or bearish, the optimal portfolio strategies from the corresponding robust optimization problem outperforms the ones without model uncertainty, showcasing the importance of taking model uncertainty into account.  ( 2 min )
    Time-inhomogeneous diffusion geometry and topology. (arXiv:2203.14860v2 [cs.LG] UPDATED)
    Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes and then applies a diffusion operator to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic condensation homology. We use this intrinsic topology as well as the ambient persistent homology of the condensation process to study how the data changes over diffusion time. We demonstrate both types of topological information in well-understood toy examples. Our work gives theoretical insights into the convergence of diffusion condensation, and shows that it provides a link between topological and geometric data analysis.  ( 2 min )
    Solving Collaborative Dec-POMDPs with Deep Reinforcement Learning Heuristics. (arXiv:2211.15411v4 [cs.LG] UPDATED)
    WQMIX, QMIX, QTRAN, and VDN are SOTA algorithms for Dec-POMDP. All of them cannot solve complex agents' cooperation domains. We give an algorithm to solve such problems. In the first stage, we solve a single-agent problem and get a policy. In the second stage, we solve the multi-agent problem with the single-agent policy. SA2MA has a clear advantage over all competitors in complex agents' cooperative domains.  ( 2 min )
    Exploration via Elliptical Episodic Bonuses. (arXiv:2210.05805v2 [cs.LG] UPDATED)
    In recent years, a number of reinforcement learning (RL) methods have been proposed to explore complex environments which differ across episodes. In this work, we show that the effectiveness of these methods critically relies on a count-based episodic term in their exploration bonus. As a result, despite their success in relatively simple, noise-free settings, these methods fall short in more realistic scenarios where the state space is vast and prone to noise. To address this limitation, we introduce Exploration via Elliptical Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces and encourages an agent to explore states that are diverse under a learned embedding within each episode. The embedding is learned using an inverse dynamics model in order to capture controllable aspects of the environment. Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases. E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat, demonstrating that it can scale to high-dimensional pixel-based observations and realistic environments.  ( 2 min )
    Understanding Representation Quality in Self-Supervised Models. (arXiv:2203.01881v3 [cs.LG] UPDATED)
    Self-supervised learning has shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure modes and interpreting their learned representations. In this paper, we study the representation space of six state-of-the-art self-supervised models including SimCLR, SwaV, MoCo, BYOL, DINO and SimSiam. Without the use of class label information, we discover highly activating features that correspond to unique physical attributes in images and exist mostly in correctly-classified representations. Using these features, we propose Self-Supervised Representation Quality Score (or Q-Score), a model-agnostic, unsupervised score that can reliably predict if a given sample is likely to be mis-classified during linear evaluation, achieving AUPRC of 91.45 on ImageNet-100 and 78.78 on ImageNet-1K. Q-Score can also be used as a regularization term on any self-supervised model to remedy low-quality representations through the course of pre-training. We show that pre-training with Q-Score regularization can boost the performance of six state-of-the-art self-supervised models on ImageNet-1K, ImageNet-100, CIFAR-10, CIFAR-100 and STL-10, showing an average relative increase of 1.8% top-1 accuracy on linear evaluation. On ImageNet-100, BYOL shows 7.2% relative improvement and on ImageNet-1K, SimCLR shows 4.7% relative improvement compared to their baselines. Finally, using gradient heatmaps and Salient ImageNet masks, we define a metric to quantify the interpretability of each representation. We show that highly activating features are strongly correlated to core attributes and enhancing these features through Q-score regularization improves the overall representation interpretability for all self-supervised models.  ( 2 min )
    AdsorbML: Accelerating Adsorption Energy Calculations with Machine Learning. (arXiv:2211.16486v2 [cond-mat.mtrl-sci] UPDATED)
    Computational catalysis is playing an increasingly significant role in the design of catalysts across a wide range of applications. A common task for many computational methods is the need to accurately compute the minimum binding energy - the adsorption energy - for an adsorbate and a catalyst surface of interest. Traditionally, the identification of low energy adsorbate-surface configurations relies on heuristic methods and researcher intuition. As the desire to perform high-throughput screening increases, it becomes challenging to use heuristics and intuition alone. In this paper, we demonstrate machine learning potentials can be leveraged to identify low energy adsorbate-surface configurations more accurately and efficiently. Our algorithm provides a spectrum of trade-offs between accuracy and efficiency, with one balanced option finding the lowest energy configuration, within a 0.1 eV threshold, 86.33% of the time, while achieving a 1331x speedup in computation. To standardize benchmarking, we introduce the Open Catalyst Dense dataset containing nearly 1,000 diverse surfaces and 85,658 unique configurations.  ( 2 min )
    Infomaxformer: Maximum Entropy Transformer for Long Time-Series Forecasting Problem. (arXiv:2301.01772v1 [cs.LG])
    The Transformer architecture yields state-of-the-art results in many tasks such as natural language processing (NLP) and computer vision (CV), since the ability to efficiently capture the precise long-range dependency coupling between input sequences. With this advanced capability, however, the quadratic time complexity and high memory usage prevents the Transformer from dealing with long time-series forecasting problem (LTFP). To address these difficulties: (i) we revisit the learned attention patterns of the vanilla self-attention, redesigned the calculation method of self-attention based the Maximum Entropy Principle. (ii) we propose a new method to sparse the self-attention, which can prevent the loss of more important self-attention scores due to random sampling.(iii) We propose Keys/Values Distilling method motivated that a large amount of feature in the original self-attention map is redundant, which can further reduce the time and spatial complexity and make it possible to input longer time-series. Finally, we propose a method that combines the encoder-decoder architecture with seasonal-trend decomposition, i.e., using the encoder-decoder architecture to capture more specific seasonal parts. A large number of experiments on several large-scale datasets show that our Infomaxformer is obviously superior to the existing methods. We expect this to open up a new solution for Transformer to solve LTFP, and exploring the ability of the Transformer architecture to capture much longer temporal dependencies.  ( 2 min )
    On Sequential Bayesian Inference for Continual Learning. (arXiv:2301.01828v1 [cs.LG])
    Sequential Bayesian inference can be used for continual learning to prevent catastrophic forgetting of past tasks and provide an informative prior when learning new tasks. We revisit sequential Bayesian inference and test whether having access to the true posterior is guaranteed to prevent catastrophic forgetting in Bayesian neural networks. To do this we perform sequential Bayesian inference using Hamiltonian Monte Carlo. We propagate the posterior as a prior for new tasks by fitting a density estimator on Hamiltonian Monte Carlo samples. We find that this approach fails to prevent catastrophic forgetting demonstrating the difficulty in performing sequential Bayesian inference in neural networks. From there we study simple analytical examples of sequential Bayesian inference and CL and highlight the issue of model misspecification which can lead to sub-optimal continual learning performance despite exact inference. Furthermore, we discuss how task data imbalances can cause forgetting. From these limitations, we argue that we need probabilistic models of the continual learning generative process rather than relying on sequential Bayesian inference over Bayesian neural network weights. In this vein, we also propose a simple baseline called Prototypical Bayesian Continual Learning, which is competitive with state-of-the-art Bayesian continual learning methods on class incremental continual learning vision benchmarks.  ( 2 min )
    Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping. (arXiv:2301.02099v1 [cs.RO])
    Developing agents that can execute multiple skills by learning from pre-collected datasets is an important problem in robotics, where online interaction with the environment is extremely time-consuming. Moreover, manually designing reward functions for every single desired skill is prohibitive. Prior works targeted these challenges by learning goal-conditioned policies from offline datasets without manually specified rewards, through hindsight relabelling. These methods suffer from the issue of sparsity of rewards, and fail at long-horizon tasks. In this work, we propose a novel self-supervised learning phase on the pre-collected dataset to understand the structure and the dynamics of the model, and shape a dense reward function for learning policies offline. We evaluate our method on three continuous control tasks, and show that our model significantly outperforms existing approaches, especially on tasks that involve long-term planning.  ( 2 min )
    Understanding Hyperdimensional Computing for Parallel Single-Pass Learning. (arXiv:2202.04805v2 [cs.LG] UPDATED)
    Hyperdimensional computing (HDC) is an emerging learning paradigm that computes with high dimensional binary vectors. It is attractive because of its energy efficiency and low latency, especially on emerging hardware -- but HDC suffers from low model accuracy, with little theoretical understanding of what limits its performance. We propose a new theoretical analysis of the limits of HDC via a consideration of what similarity matrices can be "expressed" by binary vectors, and we show how the limits of HDC can be approached using random Fourier features (RFF). We extend our analysis to the more general class of vector symbolic architectures (VSA), which compute with high-dimensional vectors (hypervectors) that are not necessarily binary. We propose a new class of VSAs, finite group VSAs, which surpass the limits of HDC. Using representation theory, we characterize which similarity matrices can be "expressed" by finite group VSA hypervectors, and we show how these VSAs can be constructed. Experimental results show that our RFF method and group VSA can both outperform the state-of-the-art HDC model by up to 7.6\% while maintaining hardware efficiency.  ( 2 min )
    Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty. (arXiv:2210.00898v2 [cs.LG] UPDATED)
    We present a novel $Q$-learning algorithm to solve distributionally robust Markov decision problems, where the corresponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a (possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified in practice.  ( 2 min )
    Robust Imitation via Mirror Descent Inverse Reinforcement Learning. (arXiv:2210.11201v2 [cs.LG] UPDATED)
    Recently, adversarial imitation learning has shown a scalable reward acquisition method for inverse reinforcement learning (IRL) problems. However, estimated reward signals often become uncertain and fail to train a reliable statistical model since the existing methods tend to solve hard optimization problems directly. Inspired by a first-order optimization method called mirror descent, this paper proposes to predict a sequence of reward functions, which are iterative solutions for a constrained convex problem. IRL solutions derived by mirror descent are tolerant to the uncertainty incurred by target density estimation since the amount of reward learning is regulated with respect to local geometric constraints. We prove that the proposed mirror descent update rule ensures robust minimization of a Bregman divergence in terms of a rigorous regret bound of $\mathcal{O}(1/T)$ for step sizes $\{\eta_t\}_{t=1}^{T}$. Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks.  ( 2 min )
    Conditional Gradients for the Approximate Vanishing Ideal. (arXiv:2202.03349v13 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the pairwise conditional gradients approximate vanishing ideal algorithm (PCGAVI) that constructs a set of generators of the approximate vanishing ideal. The constructed generators capture polynomial structures in data and give rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In PCGAVI, we construct the set of generators by solving constrained convex optimization problems with the pairwise conditional gradients algorithm. Thus, PCGAVI not only constructs few but also sparse generators, making the corresponding feature transformation robust and compact. Furthermore, we derive several learning guarantees for PCGAVI that make the algorithm theoretically better motivated than related generator-constructing methods.  ( 2 min )
    Denoising Deep Generative Models. (arXiv:2212.01265v3 [cs.LG] UPDATED)
    Likelihood-based deep generative models have recently been shown to exhibit pathological behaviour under the manifold hypothesis as a consequence of using high-dimensional densities to model data with low-dimensional structure. In this paper we propose two methodologies aimed at addressing this problem. Both are based on adding Gaussian noise to the data to remove the dimensionality mismatch during training, and both provide a denoising mechanism whose goal is to sample from the model as though no noise had been added to the data. Our first approach is based on Tweedie's formula, and the second on models which take the variance of added noise as a conditional input. We show that surprisingly, while well motivated, these approaches only sporadically improve performance over not adding noise, and that other methods of addressing the dimensionality mismatch are more empirically adequate.  ( 2 min )
    Capturing cross-session neural population variability through self-supervised identification of consistent neuron ensembles. (arXiv:2205.09829v2 [q-bio.NC] UPDATED)
    Decoding stimuli or behaviour from recorded neural activity is a common approach to interrogate brain function in research, and an essential part of brain-computer and brain-machine interfaces. Reliable decoding even from small neural populations is possible because high dimensional neural population activity typically occupies low dimensional manifolds that are discoverable with suitable latent variable models. Over time however, drifts in activity of individual neurons and instabilities in neural recording devices can be substantial, making stable decoding over days and weeks impractical. While this drift cannot be predicted on an individual neuron level, population level variations over consecutive recording sessions such as differing sets of neurons and varying permutations of consistent neurons in recorded data may be learnable when the underlying manifold is stable over time. Classification of consistent versus unfamiliar neurons across sessions and accounting for deviations in the order of consistent recording neurons in recording datasets over sessions of recordings may then maintain decoding performance. In this work we show that self-supervised training of a deep neural network can be used to compensate for this inter-session variability. As a result, a sequential autoencoding model can maintain state-of-the-art behaviour decoding performance for completely unseen recording sessions several days into the future. Our approach only requires a single recording session for training the model, and is a step towards reliable, recalibration-free brain computer interfaces.  ( 2 min )
    FF-NSL: Feed-Forward Neural-Symbolic Learner. (arXiv:2106.13103v3 [cs.LG] UPDATED)
    Logic-based machine learning aims to learn general, interpretable knowledge in a data-efficient manner. However, labelled data must be specified in a structured logical form. To address this limitation, we propose a neural-symbolic learning framework, called Feed-Forward Neural-Symbolic Learner (FFNSL), that integrates a logic-based machine learning system capable of learning from noisy examples, with neural networks, in order to learn interpretable knowledge from labelled unstructured data. We demonstrate the generality of FFNSL on four neural-symbolic classification problems, where different pre-trained neural network models and logic-based machine learning systems are integrated to learn interpretable knowledge from sequences of images. We evaluate the robustness of our framework by using images subject to distributional shifts, for which the pre-trained neural networks may predict incorrectly and with high confidence. We analyse the impact that these shifts have on the accuracy of the learned knowledge and run-time performance, comparing FFNSL to tree-based and pure neural approaches. Our experimental results show that FFNSL outperforms the baselines by learning more accurate and interpretable knowledge with fewer examples.  ( 2 min )
    LieGG: Studying Learned Lie Group Generators. (arXiv:2210.04345v2 [cs.LG] UPDATED)
    Symmetries built into a neural network have appeared to be very beneficial for a wide range of tasks as it saves the data to learn them. We depart from the position that when symmetries are not built into a model a priori, it is advantageous for robust networks to learn symmetries directly from the data to fit a task function. In this paper, we present a method to extract symmetries learned by a neural network and to evaluate the degree to which a network is invariant to them. With our method, we are able to explicitly retrieve learned invariances in a form of the generators of corresponding Lie-groups without prior knowledge of symmetries in the data. We use the proposed method to study how symmetrical properties depend on a neural network's parameterization and configuration. We found that the ability of a network to learn symmetries generalizes over a range of architectures. However, the quality of learned symmetries depends on the depth and the number of parameters.  ( 2 min )
    Global Weighted Tensor Nuclear Norm for Tensor Robust Principal Component Analysis. (arXiv:2209.14084v2 [cs.LG] UPDATED)
    Tensor Robust Principal Component Analysis (TRPCA), which aims to recover a low-rank tensor corrupted by sparse noise, has attracted much attention in many real applications. This paper develops a new Global Weighted TRPCA method (GWTRPCA), which is the first approach simultaneously considers the significance of intra-frontal slice and inter-frontal slice singular values in the Fourier domain. Exploiting this global information, GWTRPCA penalizes the larger singular values less and assigns smaller weights to them. Hence, our method can recover the low-tubal-rank components more exactly. Moreover, we propose an effective adaptive weight learning strategy by a Modified Cauchy Estimator (MCE) since the weight setting plays a crucial role in the success of GWTRPCA. To implement the GWTRPCA method, we devise an optimization algorithm using an Alternating Direction Method of Multipliers (ADMM) method. Experiments on real-world datasets validate the effectiveness of our proposed method.  ( 2 min )
    Random forests, sound symbolism and Pokemon evolution. (arXiv:2301.01948v1 [cs.LG])
    This study constructs machine learning algorithms that are trained to classify samples using sound symbolism, and then it reports on an experiment designed to measure their understanding against human participants. Random forests are trained using the names of Pokemon, which are fictional video game characters, and their evolutionary status. Pokemon undergo evolution when certain in-game conditions are met. Evolution changes the appearance, abilities, and names of Pokemon. In the first experiment, we train three random forests using the sounds that make up the names of Japanese, Chinese, and Korean Pokemon to classify Pokemon into pre-evolution and post-evolution categories. We then train a fourth random forest using the results of an elicitation experiment whereby Japanese participants named previously unseen Pokemon. In Experiment 2, we reproduce those random forests with name length as a feature and compare the performance of the random forests against humans in a classification experiment whereby Japanese participants classified the names elicited in Experiment 1 into pre-and post-evolution categories. Experiment 2 reveals an issue pertaining to overfitting in Experiment 1 which we resolve using a novel cross-validation method. The results show that the random forests are efficient learners of systematic sound-meaning correspondence patterns and can classify samples with greater accuracy than the human participants.  ( 2 min )
    Optimal lower bounds for Quantum Learning via Information Theory. (arXiv:2301.02227v1 [quant-ph])
    Although a concept class may be learnt more efficiently using quantum samples as compared with classical samples in certain scenarios, Arunachalam and de Wolf (JMLR, 2018) proved that quantum learners are asymptotically no more efficient than classical ones in the quantum PAC and Agnostic learning models. They established lower bounds on sample complexity via quantum state identification and Fourier analysis. In this paper, we derive optimal lower bounds for quantum sample complexity in both the PAC and agnostic models via an information-theoretic approach. The proofs are arguably simpler, and the same ideas can potentially be used to derive optimal bounds for other problems in quantum learning theory. We then turn to a quantum analogue of the Coupon Collector problem, a classic problem from probability theory also of importance in the study of PAC learning. Arunachalam, Belovs, Childs, Kothari, Rosmanis, and de Wolf (TQC, 2020) characterized the quantum sample complexity of this problem up to constant factors. First, we show that the information-theoretic approach mentioned above provably does not yield the optimal lower bound. As a by-product, we get a natural ensemble of pure states in arbitrarily high dimensions which are not easily (simultaneously) distinguishable, while the ensemble has close to maximal Holevo information. Second, we discover that the information-theoretic approach yields an asymptotically optimal bound for an approximation variant of the problem. Finally, we derive a sharp lower bound for the Quantum Coupon Collector problem, with the exact leading order term, via the Holevo-Curlander bounds on the distinguishability of an ensemble. All the aspects of the Quantum Coupon Collector problem we study rest on properties of the spectrum of the associated Gram matrix, which may be of independent interest.  ( 3 min )
    WPPNets and WPPFlows: The Power of Wasserstein Patch Priors for Superresolution. (arXiv:2201.08157v3 [cs.CV] UPDATED)
    Exploiting image patches instead of whole images have proved to be a powerful approach to tackle various problems in image processing. Recently, Wasserstein patch priors (WPP), which are based on the comparison of the patch distributions of the unknown image and a reference image, were successfully used as data-driven regularizers in the variational formulation of superresolution. However, for each input image, this approach requires the solution of a non-convex minimization problem which is computationally costly. In this paper, we propose to learn two kind of neural networks in an unsupervised way based on WPP loss functions. First, we show how convolutional neural networks (CNNs) can be incorporated. Once the network, called WPPNet, is learned, it can be very efficiently applied to any input image. Second, we incorporate conditional normalizing flows to provide a tool for uncertainty quantification. Numerical examples demonstrate the very good performance of WPPNets for superresolution in various image classes even if the forward operator is known only approximately.  ( 2 min )
    Grape Cold Hardiness Prediction via Multi-Task Learning. (arXiv:2209.10585v4 [cs.LG] UPDATED)
    Cold temperatures during fall and spring have the potential to cause frost damage to grapevines and other fruit plants, which can significantly decrease harvest yields. To help prevent these losses, farmers deploy expensive frost mitigation measures such as sprinklers, heaters, and wind machines when they judge that damage may occur. This judgment, however, is challenging because the cold hardiness of plants changes throughout the dormancy period and it is difficult to directly measure. This has led scientists to develop cold hardiness prediction models that can be tuned to different grape cultivars based on laborious field measurement data. In this paper, we study whether deep learning models can improve cold hardiness prediction for grapes based on data that has been collected over a 30-year time period. A key challenge is that the amount of data per cultivar is highly variable, with some cultivars having only a small amount. For this purpose, we investigate the use of multi-task learning to leverage data across cultivars in order to improve prediction performance for individual cultivars. We evaluate a number of multi-task learning approaches and show that the highest performing approach is able to significantly improve over learning for single cultivars and outperforms the current state-of-the-art scientific model for most cultivars.  ( 2 min )
    Symmetry Teleportation for Accelerated Optimization. (arXiv:2205.10637v3 [cs.LG] UPDATED)
    Existing gradient-based optimization methods update parameters locally, in a direction that minimizes the loss function. We study a different approach, symmetry teleportation, that allows parameters to travel a large distance on the loss level set, in order to improve the convergence speed in subsequent steps. Teleportation exploits symmetries in the loss landscape of optimization problems. We derive loss-invariant group actions for test functions in optimization and multi-layer neural networks, and prove a necessary condition for teleportation to improve convergence rate. We also show that our algorithm is closely related to second order methods. Experimentally, we show that teleportation improves the convergence speed of gradient descent and AdaGrad for several optimization problems including test functions, multi-layer regressions, and MNIST classification.  ( 2 min )
    I'm Me, We're Us, and I'm Us: Tri-directional Contrastive Learning on Hypergraphs. (arXiv:2206.04739v4 [cs.LG] UPDATED)
    Although machine learning on hypergraphs has attracted considerable attention, most of the works have focused on (semi-)supervised learning, which may cause heavy labeling costs and poor generalization. Recently, contrastive learning has emerged as a successful unsupervised representation learning method. Despite the prosperous development of contrastive learning in other domains, contrastive learning on hypergraphs remains little explored. In this paper, we propose TriCL (Tri-directional Contrastive Learning), a general framework for contrastive learning on hypergraphs. Its main idea is tri-directional contrast, and specifically, it aims to maximize in two augmented views the agreement (a) between the same node, (b) between the same group of nodes, and (c) between each group and its members. Together with simple but surprisingly effective data augmentation and negative sampling schemes, these three forms of contrast enable TriCL to capture both microscopic and mesoscopic structural information in node embeddings. Our extensive experiments using 13 baseline approaches, five datasets, and two tasks demonstrate the effectiveness of TriCL, and most noticeably, TriCL consistently outperforms not just unsupervised competitors but also (semi-)supervised competitors mostly by significant margins for node classification. The code and datasets are available at https://github.com/wooner49/TriCL.  ( 2 min )
    Verifying Inverse Model Neural Networks. (arXiv:2202.02429v2 [cs.LG] UPDATED)
    Inverse problems exist in a wide variety of physical domains from aerospace engineering to medical imaging. The goal is to infer the underlying state from a set of observations. When the forward model that produced the observations is nonlinear and stochastic, solving the inverse problem is very challenging. Neural networks are an appealing solution for solving inverse problems as they can be trained from noisy data and once trained are computationally efficient to run. However, inverse model neural networks do not have guarantees of correctness built-in, which makes them unreliable for use in safety and accuracy-critical contexts. In this work we introduce a method for verifying the correctness of inverse model neural networks. Our approach is to overapproximate a nonlinear, stochastic forward model with piecewise linear constraints and encode both the overapproximate forward model and the neural network inverse model as a mixed-integer program. We demonstrate this verification procedure on a real-world airplane fuel gauge case study. The ability to verify and consequently trust inverse model neural networks allows their use in a wide variety of contexts, from aerospace to medicine.  ( 2 min )
    Interpretable Learned Emergent Communication for Human-Agent Teams. (arXiv:2201.07452v2 [cs.LG] UPDATED)
    Learning interpretable communication is essential for multi-agent and human-agent teams (HATs). In multi-agent reinforcement learning for partially-observable environments, agents may convey information to others via learned communication, allowing the team to complete its task. Inspired by human languages, recent works study discrete (using only a finite set of tokens) and sparse (communicating only at some time-steps) communication. However, the utility of such communication in human-agent team experiments has not yet been investigated. In this work, we analyze the efficacy of sparse-discrete methods for producing emergent communication that enables high agent-only and human-agent team performance. We develop agent-only teams that communicate sparsely via our scheme of Enforcers that sufficiently constrain communication to any budget. Our results show no loss or minimal loss of performance in benchmark environments and tasks. In human-agent teams tested in benchmark environments, where agents have been modeled using the Enforcers, we find that a prototype-based method produces meaningful discrete tokens that enable human partners to learn agent communication faster and better than a one-hot baseline. Additional HAT experiments show that an appropriate sparsity level lowers the cognitive load of humans when communicating with teams of agents and leads to superior team performance.  ( 2 min )
    A general framework for implementing distances for categorical variables. (arXiv:2301.02190v1 [stat.ML])
    The degree to which subjects differ from each other with respect to certain properties measured by a set of variables, plays an important role in many statistical methods. For example, classification, clustering, and data visualization methods all require a quantification of differences in the observed values. We can refer to the quantification of such differences, as distance. An appropriate definition of a distance depends on the nature of the data and the problem at hand. For distances between numerical variables, there exist many definitions that depend on the size of the observed differences. For categorical data, the definition of a distance is more complex, as there is no straightforward quantification of the size of the observed differences. Consequently, many proposals exist that can be used to measure differences based on categorical variables. In this paper, we introduce a general framework that allows for an efficient and transparent implementation of distances between observations on categorical variables. We show that several existing distances can be incorporated into the framework. Moreover, our framework quite naturally leads to the introduction of new distance formulations and allows for the implementation of flexible, case and data specific distance definitions. Furthermore, in a supervised classification setting, the framework can be used to construct distances that incorporate the association between the response and predictor variables and hence improve the performance of distance-based classifiers.  ( 2 min )
    UDC: Unified DNAS for Compressible TinyML Models. (arXiv:2201.05842v4 [cs.LG] UPDATED)
    Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across which we must make balanced trade-offs. This paper demonstrates Unified DNAS for Compressible (UDC) NNs, which explores a large search space to generate state-of-the-art compressible NNs for NPU. ImageNet results show UDC networks are up to $3.35\times$ smaller (iso-accuracy) or 6.25% more accurate (iso-model size) than previous work.  ( 2 min )
    L-HYDRA: Multi-Head Physics-Informed Neural Networks. (arXiv:2301.02152v1 [cs.LG])
    We introduce multi-head neural networks (MH-NNs) to physics-informed machine learning, which is a type of neural networks (NNs) with all nonlinear hidden layers as the body and multiple linear output layers as multi-head. Hence, we construct multi-head physics-informed neural networks (MH-PINNs) as a potent tool for multi-task learning (MTL), generative modeling, and few-shot learning for diverse problems in scientific machine learning (SciML). MH-PINNs connect multiple functions/tasks via a shared body as the basis functions as well as a shared distribution for the head. The former is accomplished by solving multiple tasks with MH-PINNs with each head independently corresponding to each task, while the latter by employing normalizing flows (NFs) for density estimate and generative modeling. To this end, our method is a two-stage method, and both stages can be tackled with standard deep learning tools of NNs, enabling easy implementation in practice. MH-PINNs can be used for various purposes, such as approximating stochastic processes, solving multiple tasks synergistically, providing informative prior knowledge for downstream few-shot learning tasks such as meta-learning and transfer learning, learning representative basis functions, and uncertainty quantification. We demonstrate the effectiveness of MH-PINNs in five benchmarks, investigating also the possibility of synergistic learning in regression analysis. We name the open-source code "Lernaean Hydra" (L-HYDRA), since this mythical creature possessed many heads for performing important multiple tasks, as in the proposed method.  ( 2 min )
    Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task). (arXiv:2301.02113v1 [cs.CL])
    We describe three models submitted for the CODI-CRAC 2022 shared task. To perform identity anaphora resolution, we test several combinations of the incremental clustering approach based on the Workspace Coreference System (WCS) with other coreference models. The best result is achieved by adding the ''cluster merging'' version of the coref-hoi model, which brings up to 10.33% improvement 1 over vanilla WCS clustering. Discourse deixis resolution is implemented as multi-task learning: we combine the learning objective of corefhoi with anaphor type classification. We adapt the higher-order resolution model introduced in Joshi et al. (2019) for bridging resolution given gold mentions and anaphors.  ( 2 min )
    Beyond spectral gap (extended): The role of the topology in decentralized learning. (arXiv:2301.02151v1 [cs.LG])
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. In the decentralized setting, in which workers communicate over a sparse graph, current theory fails to capture important aspects of real-world behavior. First, the `spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence dynamics in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies. This paper is an extension of the conference paper by Vogels et. al. (2022). Code: https://github.com/epfml/topology-in-decentralized-learning.  ( 2 min )
    Solving Multistage Stochastic Linear Programming via Regularized Linear Decision Rules: An Application to Hydrothermal Dispatch Planning. (arXiv:2110.03146v3 [math.OC] UPDATED)
    The solution of multistage stochastic linear problems (MSLP) represents a challenge for many application areas. Long-term hydrothermal dispatch planning (LHDP) materializes this challenge in a real-world problem that affects electricity markets, economies, and natural resources worldwide. No closed-form solutions are available for MSLP and the definition of non-anticipative policies with high-quality out-of-sample performance is crucial. Linear decision rules (LDR) provide an interesting simulation-based framework for finding high-quality policies for MSLP through two-stage stochastic models. In practical applications, however, the number of parameters to be estimated when using an LDR may be close to or higher than the number of scenarios of the sample average approximation problem, thereby generating an in-sample overfit and poor performances in out-of-sample simulations. In this paper, we propose a novel regularized LDR to solve MSLP based on the AdaLASSO (adaptive least absolute shrinkage and selection operator). The goal is to use the parsimony principle, as largely studied in high-dimensional linear regression models, to obtain better out-of-sample performance for LDR applied to MSLP. Computational experiments show that the overfit threat is non-negligible when using classical non-regularized LDR to solve the LHDP, one of the most studied MSLP with relevant applications. Our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark: 1) significant reductions in the number of non-zero coefficients (model parsimony), 2) substantial cost reductions in out-of-sample evaluations, and 3) improved spot-price profiles.  ( 3 min )
    Semantic match: Debugging feature attribution methods in XAI for healthcare. (arXiv:2301.02080v1 [cs.AI])
    The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques and especially feature attribution methods, questioning their use and inclusion in guidelines and standards. Despite valid concerns, we argue that existing criticism on the viability of post-hoc local explainability methods throws away the baby with the bathwater by generalizing a problem that is specific to image data. We begin by characterizing the problem as a lack of semantic match between explanations and human understanding. To understand when feature importance can be used reliably, we introduce a distinction between feature importance of low- and high-level features. We argue that for data types where low-level features come endowed with a clear semantics, such as tabular data like Electronic Health Records (EHRs), semantic match can be obtained, and thus feature attribution methods can still be employed in a meaningful and useful way.  ( 2 min )
    A Distance-Geometric Method for Recovering Robot Joint Angles From an RGB Image. (arXiv:2301.02051v1 [cs.RO])
    Autonomous manipulation systems operating in domains where human intervention is difficult or impossible (e.g., underwater, extraterrestrial or hazardous environments) require a high degree of robustness to sensing and communication failures. Crucially, motion planning and control algorithms require a stream of accurate joint angle data provided by joint encoders, the failure of which may result in an unrecoverable loss of functionality. In this paper, we present a novel method for retrieving the joint angles of a robot manipulator using only a single RGB image of its current configuration, opening up an avenue for recovering system functionality when conventional proprioceptive sensing is unavailable. Our approach, based on a distance-geometric representation of the configuration space, exploits the knowledge of a robot's kinematic model with the goal of training a shallow neural network that performs a 2D-to-3D regression of distances associated with detected structural keypoints. It is shown that the resulting Euclidean distance matrix uniquely corresponds to the observed configuration, where joint angles can be recovered via multidimensional scaling and a simple inverse kinematics procedure. We evaluate the performance of our approach on real RGB images of a Franka Emika Panda manipulator, showing that the proposed method is efficient and exhibits solid generalization ability. Furthermore, we show that our method can be easily combined with a dense refinement technique to obtain superior results.  ( 2 min )
    CA$^2$T-Net: Category-Agnostic 3D Articulation Transfer from Single Image. (arXiv:2301.02232v1 [cs.CV])
    We present a neural network approach to transfer the motion from a single image of an articulated object to a rest-state (i.e., unarticulated) 3D model. Our network learns to predict the object's pose, part segmentation, and corresponding motion parameters to reproduce the articulation shown in the input image. The network is composed of three distinct branches that take a shared joint image-shape embedding and is trained end-to-end. Unlike previous methods, our approach is independent of the topology of the object and can work with objects from arbitrary categories. Our method, trained with only synthetic data, can be used to automatically animate a mesh, infer motion from real images, and transfer articulation to functionally similar but geometrically distinct 3D models at test time.  ( 2 min )
    Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization. (arXiv:2301.02220v1 [stat.ML])
    Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing literature are developed in \textit{online} settings where the data are easy to collect or simulate. Motivated by high stake domains such as mobile health studies with limited and pre-collected data, in this paper, we study \textit{offline} reinforcement learning methods. To efficiently use these datasets for policy optimization, we propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when the initial policy is not consistent, our method will output a policy whose value is no worse and often better than that of the initial policy. When the initial policy is consistent, under some mild conditions, our method will yield a policy whose value converges to the optimal one at a faster rate than the initial policy, achieving the desired ``value enhancement" property. The proposed method is generally applicable to any parametrized policy that belongs to certain pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to demonstrate the superior performance of our method.  ( 2 min )
    Self-Motivated Multi-Agent Exploration. (arXiv:2301.02083v1 [cs.LG])
    In cooperative multi-agent reinforcement learning (CMARL), it is critical for agents to achieve a balance between self-exploration and team collaboration. However, agents can hardly accomplish the team task without coordination and they would be trapped in a local optimum where easy cooperation is accessed without enough individual exploration. Recent works mainly concentrate on agents' coordinated exploration, which brings about the exponentially grown exploration of the state space. To address this issue, we propose Self-Motivated Multi-Agent Exploration (SMMAE), which aims to achieve success in team tasks by adaptively finding a trade-off between self-exploration and team cooperation. In SMMAE, we train an independent exploration policy for each agent to maximize their own visited state space. Each agent learns an adjustable exploration probability based on the stability of the joint team policy. The experiments on highly cooperative tasks in StarCraft II micromanagement benchmark (SMAC) demonstrate that SMMAE can explore task-related states more efficiently, accomplish coordinated behaviours and boost the learning performance.  ( 2 min )
    Reprogramming Pretrained Language Models for Protein Sequence Representation Learning. (arXiv:2301.02120v1 [cs.LG])
    Machine Learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale model is computationally expensive. Here, we propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to $10^5$ times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram an off-the-shelf pre-trained English language transformer and benchmark it on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, stability) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity).  ( 2 min )
    Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI. (arXiv:2301.01874v1 [cs.CL])
    Detecting "toxic" language in internet content is a pressing social and technical challenge. In this work, we focus on PERSPECTIVE from Jigsaw, a state-of-the-art tool that promises to score the "toxicity" of text, with a recent model update that claims impressive results (Lees et al., 2022). We seek to challenge certain normative claims about toxic language by proposing a new benchmark, Selected Adversarial SemanticS, or SASS. We evaluate PERSPECTIVE on SASS, and compare to low-effort alternatives, like zero-shot and few-shot GPT-3 prompt models, in binary classification settings. We find that PERSPECTIVE exhibits troubling shortcomings across a number of our toxicity categories. SASS provides a new tool for evaluating performance on previously undetected toxic language that avoids common normative pitfalls. Our work leads us to emphasize the importance of questioning assumptions made by tools already in deployment for toxicity detection in order to anticipate and prevent disparate harms.  ( 2 min )
    Instance-based Explanations for Gradient Boosting Machine Predictions with AXIL Weights. (arXiv:2301.01864v1 [cs.LG])
    We show that regression predictions from linear and tree-based models can be represented as linear combinations of target instances in the training data. This also holds for models constructed as ensembles of trees, including Random Forests and Gradient Boosting Machines. The weights used in these linear combinations are measures of instance importance, complementing existing measures of feature importance, such as SHAP and LIME. We refer to these measures as AXIL weights (Additive eXplanations with Instance Loadings). Since AXIL weights are additive across instances, they offer both local and global explanations. Our work contributes to the broader effort to make machine learning predictions more interpretable and explainable.  ( 2 min )
    PMP: Privacy-Aware Matrix Profile against Sensitive Pattern Inference for Time Series. (arXiv:2301.01838v1 [cs.LG])
    Recent rapid development of sensor technology has allowed massive fine-grained time series (TS) data to be collected and set the foundation for the development of data-driven services and applications. During the process, data sharing is often involved to allow the third-party modelers to perform specific time series data mining (TSDM) tasks based on the need of data owner. The high resolution of TS brings new challenges in protecting privacy. While meaningful information in high-resolution TS shifts from concrete point values to local shape-based segments, numerous research have found that long shape-based patterns could contain more sensitive information and may potentially be extracted and misused by a malicious third party. However, the privacy issue for TS patterns is surprisingly seldom explored in privacy-preserving literature. In this work, we consider a new privacy-preserving problem: preventing malicious inference on long shape-based patterns while preserving short segment information for the utility task performance. To mitigate the challenge, we investigate an alternative approach by sharing Matrix Profile (MP), which is a non-linear transformation of original data and a versatile data structure that supports many data mining tasks. We found that while MP can prevent concrete shape leakage, the canonical correlation in MP index can still reveal the location of sensitive long pattern. Based on this observation, we design two attacks named Location Attack and Entropy Attack to extract the pattern location from MP. To further protect MP from these two attacks, we propose a Privacy-Aware Matrix Profile (PMP) via perturbing the local correlation and breaking the canonical correlation in MP index vector. We evaluate our proposed PMP against baseline noise-adding methods through quantitative analysis and real-world case studies to show the effectiveness of the proposed method.  ( 2 min )
    Multi-Task Learning for Budbreak Prediction. (arXiv:2301.01815v1 [cs.LG])
    Grapevine budbreak is a key phenological stage of seasonal development, which serves as a signal for the onset of active growth. This is also when grape plants are most vulnerable to damage from freezing temperatures. Hence, it is important for winegrowers to anticipate the day of budbreak occurrence to protect their vineyards from late spring frost events. This work investigates deep learning for budbreak prediction using data collected for multiple grape cultivars. While some cultivars have over 30 seasons of data others have as little as 4 seasons, which can adversely impact prediction accuracy. To address this issue, we investigate multi-task learning, which combines data across all cultivars to make predictions for individual cultivars. Our main result shows that several variants of multi-task learning are all able to significantly improve prediction accuracy compared to learning for each cultivar independently.  ( 2 min )
    Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution. (arXiv:2301.02068v1 [cs.LG])
    Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. Though one could lower the complexity of Transformers by inducing the sparsity in point-wise self-attentions for LTTF, the limited information utilization prohibits the model from exploring the complex dependencies comprehensively. To this end, we propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects: (i) an encoder-decoder architecture incorporating a linear complexity without sacrificing information utilization is proposed on top of sliding-window attention and Stationary and Instant Recurrent Network (SIRN); (ii) a module derived from the normalizing flow is devised to further improve the information utilization by inferring the outputs with the latent variables in SIRN directly; (iii) the inter-series correlation and temporal dynamics in time-series data are modeled explicitly to fuel the downstream self-attention mechanism. Extensive experiments on seven real-world datasets demonstrate that Conformer outperforms the state-of-the-art methods on LTTF and generates reliable prediction results with uncertainty quantification.  ( 2 min )
    A deep learning approach to using wearable seismocardiography (SCG) for diagnosing aortic valve stenosis and predicting aortic hemodynamics obtained by 4D flow MRI. (arXiv:2301.02130v1 [cs.LG])
    In this paper, we explored the use of deep learning for the prediction of aortic flow metrics obtained using 4D flow MRI using wearable seismocardiography (SCG) devices. 4D flow MRI provides a comprehensive assessment of cardiovascular hemodynamics, but it is costly and time-consuming. We hypothesized that deep learning could be used to identify pathological changes in blood flow, such as elevated peak systolic velocity Vmax in patients with heart valve diseases, from SCG signals. We also investigated the ability of this deep learning technique to differentiate between patients diagnosed with aortic valve stenosis (AS), non-AS patients with a bicuspid aortic valve (BAV), non-AS patients with a mechanical aortic valve (MAV), and healthy subjects with a normal tricuspid aortic valve (TAV). In a study of 77 subjects who underwent same-day 4D flow MRI and SCG, we found that the Vmax values obtained using deep learning and SCGs were in good agreement with those obtained by 4D flow MRI. Additionally, subjects with TAV, BAV, MAV, and AS could be classified with ROC-AUC values of 92%, 95%, 81%, and 83%, respectively. This suggests that SCG obtained using low-cost wearable electronics may be used as a supplement to 4D flow MRI exams or as a screening tool for aortic valve disease.  ( 2 min )
    Plant species richness prediction from DESIS hyperspectral data: A comparison study on feature extraction procedures and regression models. (arXiv:2301.01918v1 [cs.LG])
    The diversity of terrestrial vascular plants plays a key role in maintaining the stability and productivity of ecosystems. Monitoring species compositional diversity across large spatial scales is challenging and time consuming. The advanced spectral and spatial specification of the recently launched DESIS (the DLR Earth Sensing Imaging Spectrometer) instrument provides a unique opportunity to test the potential for monitoring plant species diversity with spaceborne hyperspectral data. This study provides a quantitative assessment on the ability of DESIS hyperspectral data for predicting plant species richness in two different habitat types in southeast Australia. Spectral features were first extracted from the DESIS spectra, then regressed against on-ground estimates of plant species richness, with a two-fold cross validation scheme to assess the predictive performance. We tested and compared the effectiveness of Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), and Partial Least Squares analysis (PLS) for feature extraction, and Kernel Ridge Regression (KRR), Gaussian Process Regression (GPR), Random Forest Regression (RFR) for species richness prediction. The best prediction results were r=0.76 and RMSE=5.89 for the Southern Tablelands region, and r=0.68 and RMSE=5.95 for the Snowy Mountains region. Relative importance analysis for the DESIS spectral bands showed that the red-edge, red, and blue spectral regions were more important for predicting plant species richness than the green bands and the near-infrared bands beyond red-edge. We also found that the DESIS hyperspectral data performed better than Sentinel-2 multispectral data in the prediction of plant species richness. Our results provide a quantitative reference for future studies exploring the potential of spaceborne hyperspectral data for plant biodiversity mapping.  ( 2 min )
    MessageNet: Message Classification using Natural Language Processing and Meta-data. (arXiv:2301.01808v1 [cs.LG])
    In this paper we propose a new Deep Learning (DL) approach for message classification. Our method is based on the state-of-the-art Natural Language Processing (NLP) building blocks, combined with a novel technique for infusing the meta-data input that is typically available in messages such as the sender information, timestamps, attached image, audio, affiliations, and more. As we demonstrate throughout the paper, going beyond the mere text by leveraging all available channels in the message, could yield an improved representation and higher classification accuracy. To achieve message representation, each type of input is processed in a dedicated block in the neural network architecture that is suitable for the data type. Such an implementation enables training all blocks together simultaneously, and forming cross channels features in the network. We show in the Experiments Section that in some cases, message's meta-data holds an additional information that cannot be extracted just from the text, and when using this information we achieve better performance. Furthermore, we demonstrate that our multi-modality block approach outperforms other approaches for injecting the meta data to the the text classifier.  ( 2 min )
    Privacy and Efficiency of Communications in Federated Split Learning. (arXiv:2301.01824v1 [cs.LG])
    Everyday, large amounts of sensitive data \sai{is} distributed across mobile phones, wearable devices, and other sensors. Traditionally, these enormous datasets have been processed on a single system, with complex models being trained to make valuable predictions. Distributed machine learning techniques such as Federated and Split Learning have recently been developed to protect user \sai{data and} privacy better while ensuring high performance. Both of these distributed learning architectures have advantages and disadvantages. In this paper, we examine these tradeoffs and suggest a new hybrid Federated Split Learning architecture that combines the efficiency and privacy benefits of both. Our evaluation demonstrates how our hybrid Federated Split Learning approach can lower the amount of processing power required by each client running a distributed learning system, reduce training and inference time while keeping a similar accuracy. We also discuss the resiliency of our approach to deep learning privacy inference attacks and compare our solution to other recently proposed benchmarks.  ( 2 min )
    Fragment-based t-SMILES for de novo molecular generation. (arXiv:2301.01829v1 [cs.LG])
    At present, sequence-based and graph-based models are two of popular used molecular generative models. In this study, we introduce a general-purposed, fragment-based, hierarchical molecular representation named t-SMILES (tree-based SMILES) which describes molecules using a SMILES-type string obtained by doing breadth first search (BFS) on full binary molecular tree formed from fragmented molecular graph. The proposed t-SMILES combines the advantages of graph model paying more attention to molecular topology structure and language model possessing powerful learning ability. Experiments with feature tree rooted JTVAE and chemical reaction-based BRICS molecular decomposing algorithms using sequence-based autoregressive generation models on three popular molecule datasets including Zinc, QM9 and ChEMBL datasets indicate that t-SMILES based models significantly outperform previously proposed fragment-based models and being competitive with classical SMILES based and graph-based approaches. Most importantly, we proposed a new perspective for fragment based molecular designing. Hence, SOTA powerful sequence-based solutions could be easily applied for fragment based molecular tasks.  ( 2 min )
    Bayesian Weapon System Reliability Modeling with Cox-Weibull Neural Network. (arXiv:2301.01850v1 [stat.AP])
    We propose to integrate weapon system features (such as weapon system manufacturer, deployment time and location, storage time and location, etc.) into a parameterized Cox-Weibull reliability model via a neural network, like DeepSurv, to improve predictive maintenance. In parallel, we develop an alternative Bayesian model by parameterizing the Weibull parameters with a neural network and employing dropout methods such as Monte-Carlo (MC)-dropout for comparative purposes. Due to data collection procedures in weapon system testing we employ a novel interval-censored log-likelihood which incorporates Monte-Carlo Markov Chain (MCMC) sampling of the Weibull parameters during gradient descent optimization. We compare classification metrics such as receiver operator curve (ROC), area under the curve (AUC), and F scores and show that our model generally outperforms traditional powerful models such as XGBoost as well as the current standard conditional Weibull probability density estimation model.  ( 2 min )
    A first-order augmented Lagrangian method for constrained minimax optimization. (arXiv:2301.02060v1 [math.OC])
    In this paper we study a class of constrained minimax problems. In particular, we propose a first-order augmented Lagrangian method for solving them, whose subproblems turn out to be a much simpler structured minimax problem and are suitably solved by a first-order method recently developed in [26] by the authors. Under some suitable assumptions, an \emph{operation complexity} of ${\cal O}(\varepsilon^{-4}\log\varepsilon^{-1})$, measured by its fundamental operations, is established for the first-order augmented Lagrangian method for finding an $\varepsilon$-KKT solution of the constrained minimax problems.  ( 2 min )
    A Protocol for Intelligible Interaction Between Agents That Learn and Explain. (arXiv:2301.01819v1 [cs.AI])
    Recent engineering developments have seen the emergence of Machine Learning (ML) as a powerful form of data analysis with widespread applicability beyond its historical roots in the design of autonomous agents. However, relatively little attention has been paid to the interaction between people and ML systems. Recent developments on Explainable ML address this by providing visual and textual information on how the ML system arrived at a conclusion. In this paper we view the interaction between humans and ML systems within the broader context of interaction between agents capable of learning and explanation. Within this setting, we argue that it is more helpful to view the interaction as characterised by two-way intelligibility of information rather than once-off explanation of a prediction. We formulate two-way intelligibility as a property of a communication protocol. Development of the protocol is motivated by a set of `Intelligibility Axioms' for decision-support systems that use ML with a human-in-the-loop. The axioms are intended as sufficient criteria to claim that: (a) information provided by a human is intelligible to an ML system; and (b) information provided by an ML system is intelligible to a human. The axioms inform the design of a general synchronous interaction model between agents capable of learning and explanation. We identify conditions of compatibility between agents that result in bounded communication, and define Weak and Strong Two-Way Intelligibility between agents as properties of the communication protocol.  ( 2 min )
    Reinforcement Learning With Sparse-Executing Actions via Sparsity Regularization. (arXiv:2105.08666v3 [cs.LG] UPDATED)
    Reinforcement learning (RL) has made remarkable progress in many decision-making tasks, such as Go, game playing, and robotics control. However, classic RL approaches often presume that all actions can be executed an infinite number of times, which is inconsistent with many decision-making scenarios in which actions have limited budgets or execution opportunities. Imagine an agent playing a gunfighting game with limited ammunition. It only fires when the enemy appears in the correct position, making shooting a sparse-executing action. Such sparse-executing action has not been considered by classic RL algorithms in problem formulation or effective algorithms design. This paper attempts to address sparse-executing action issues by first formalizing the problem as a Sparse Action Markov Decision Process (SA-MDP), in which certain actions in the action space can only be executed for limited amounts of time. Then, we propose a policy optimization algorithm called Action Sparsity REgularization (ASRE) that gives each action a distinct preference. ASRE evaluates action sparsity through constrained action sampling and regularizes policy training based on the evaluated action sparsity, represented by action distribution. Experiments on tasks with known sparse-executing actions, where classical RL algorithms struggle to train policy efficiently, ASRE effectively constrains the action sampling and outperforms baselines. Moreover, we present that ASRE can generally improve the performance in Atari games, demonstrating its broad applicability  ( 2 min )
    Network Utility Maximization with Unknown Utility Functions: A Distributed, Data-Driven Bilevel Optimization Approach. (arXiv:2301.01801v1 [cs.LG])
    Fair resource allocation is one of the most important topics in communication networks. Existing solutions almost exclusively assume each user utility function is known and concave. This paper seeks to answer the following question: how to allocate resources when utility functions are unknown, even to the users? This answer has become increasingly important in the next-generation AI-aware communication networks where the user utilities are complex and their closed-forms are hard to obtain. In this paper, we provide a new solution using a distributed and data-driven bilevel optimization approach, where the lower level is a distributed network utility maximization (NUM) algorithm with concave surrogate utility functions, and the upper level is a data-driven learning algorithm to find the best surrogate utility functions that maximize the sum of true network utility. The proposed algorithm learns from data samples (utility values or gradient values) to autotune the surrogate utility functions to maximize the true network utility, so works for unknown utility functions. For the general network, we establish the nonasymptotic convergence rate of the proposed algorithm with nonconcave utility functions. The simulations validate our theoretical results and demonstrate the great effectiveness of the proposed method in a real-world network.  ( 2 min )
    Comprehensive analysis of gene expression profiles to radiation exposure reveals molecular signatures of low-dose radiation response. (arXiv:2301.01769v1 [q-bio.GN])
    There are various sources of ionizing radiation exposure, where medical exposure for radiation therapy or diagnosis is the most common human-made source. Understanding how gene expression is modulated after ionizing radiation exposure and investigating the presence of any dose-dependent gene expression patterns have broad implications for health risks from radiotherapy, medical radiation diagnostic procedures, as well as other environmental exposure. In this paper, we perform a comprehensive pathway-based analysis of gene expression profiles in response to low-dose radiation exposure, in order to examine the potential mechanism of gene regulation underlying such responses. To accomplish this goal, we employ a statistical framework to determine whether a specific group of genes belonging to a known pathway display coordinated expression patterns that are modulated in a manner consistent with the radiation level. Findings in our study suggest that there exist complex yet consistent signatures that reflect the molecular response to radiation exposure, which differ between low-dose and high-dose radiation.  ( 2 min )
    Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations. (arXiv:2301.02184v1 [cs.CV])
    Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: this http URL  ( 2 min )
    PA-GM: Position-Aware Learning of Embedding Networks for Deep Graph Matching. (arXiv:2301.01932v1 [cs.CV])
    Graph matching can be formalized as a combinatorial optimization problem, where there are corresponding relationships between pairs of nodes that can be represented as edges. This problem becomes challenging when there are potential ambiguities present due to nodes and edges with high similarity, and there is a need to find accurate results for similar content matching. In this paper, we introduce a novel end-to-end neural network that can map the linear assignment problem into a high-dimensional space augmented with node-level relative position information, which is crucial for improving the method's performance for similar content matching. Our model constructs the anchor set for the relative position of nodes and then aggregates the feature information of the target node and each anchor node based on a measure of relative position. It then learns the node feature representation by integrating the topological structure and the relative position information, thus realizing the linear assignment between the two graphs. To verify the effectiveness and generalizability of our method, we conduct graph matching experiments, including cross-category matching, on different real-world datasets. Comparisons with different baselines demonstrate the superiority of our method. Our source code is available under https://github.com/anonymous.  ( 2 min )
    RePAD: Real-time Proactive Anomaly Detection for Time Series. (arXiv:2001.08922v7 [cs.LG] UPDATED)
    During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historic data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.  ( 2 min )
    fintech-kMC: Agent based simulations of financial platforms for design and testing of machine learning systems. (arXiv:2301.01807v1 [cs.LG])
    We discuss our simulation tool, fintech-kMC, which is designed to generate synthetic data for machine learning model development and testing. fintech-kMC is an agent-based model driven by a kinetic Monte Carlo (a.k.a. continuous time Monte Carlo) engine which simulates the behaviour of customers using an online digital financial platform. The tool provides an interpretable, reproducible, and realistic way of generating synthetic data which can be used to validate and test AI/ML models and pipelines to be used in real-world customer-facing financial applications.  ( 2 min )
    Comparing Ordering Strategies For Process Discovery Using Synthesis Rules. (arXiv:2301.02182v1 [cs.DB])
    Process discovery aims to learn process models from observed behaviors, i.e., event logs, in the information systems.The discovered models serve as the starting point for process mining techniques that are used to address performance and compliance problems. Compared to the state-of-the-art Inductive Miner, the algorithm applying synthesis rules from the free-choice net theory discovers process models with more flexible (non-block) structures while ensuring the same desirable soundness and free-choiceness properties. Moreover, recent development in this line of work shows that the discovered models have compatible quality. Following the synthesis rules, the algorithm incrementally modifies an existing process model by adding the activities in the event log one at a time. As the applications of rules are highly dependent on the existing model structure, the model quality and computation time are significantly influenced by the order of adding activities. In this paper, we investigate the effect of different ordering strategies on the discovered models (w.r.t. fitness and precision) and the computation time using real-life event data. The results show that the proposed ordering strategy can improve the quality of the resulting process models while requiring less time compared to the ordering strategy solely based on the frequency of activities.  ( 2 min )
    Playing hide and seek: tackling in-store picking operations while improving customer experience. (arXiv:2301.02142v1 [cs.LG])
    The evolution of the retail business presents new challenges and raises pivotal questions on how to reinvent stores and supply chains to meet the growing demand of the online channel. One of the recent measures adopted by omnichannel retailers is to address the growth of online sales using in-store picking, which allows serving online orders using existing assets. However, it comes with the downside of harming the offline customer experience. To achieve picking policies adapted to the dynamic customer flows of a retail store, we formalize a new problem called Dynamic In-store Picker Routing Problem (diPRP). In this relevant problem - diPRP - a picker tries to pick online orders while minimizing customer encounters. We model the problem as a Markov Decision Process (MDP) and solve it using a hybrid solution approach comprising mathematical programming and reinforcement learning components. Computational experiments on synthetic instances suggest that the algorithm converges to efficient policies. Furthermore, we apply our approach in the context of a large European retailer to assess the results of the proposed policies regarding the number of orders picked and customers encountered. Our work suggests that retailers should be able to scale the in-store picking of online orders without jeopardizing the experience of offline customers. The policies learned using the proposed solution approach reduced the number of customer encounters by more than 50% when compared to policies solely focused on picking orders. Thus, to pursue omnichannel strategies that adequately trade-off operational efficiency and customer experience, retailers cannot rely on actual simplistic picking strategies, such as choosing the shortest possible route.  ( 2 min )
    Differentially Private Federated Learning on Heterogeneous Data. (arXiv:2111.09278v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a paradigm for large-scale distributed learning which faces two key challenges: (i) efficient training from highly heterogeneous user data, and (ii) protecting the privacy of participating users. In this work, we propose a novel FL approach (DP-SCAFFOLD) to tackle these two challenges together by incorporating Differential Privacy (DP) constraints into the popular SCAFFOLD algorithm. We focus on the challenging setting where users communicate with a "honest-but-curious" server without any trusted intermediary, which requires to ensure privacy not only towards a third-party with access to the final model but also towards the server who observes all user communications. Using advanced results from DP theory, we establish the convergence of our algorithm for convex and non-convex objectives. Our analysis clearly highlights the privacy-utility trade-off under data heterogeneity, and demonstrates the superiority of DP-SCAFFOLD over the state-of-the-art algorithm DP-FedAvg when the number of local updates and the level of heterogeneity grow. Our numerical results confirm our analysis and show that DP-SCAFFOLD provides significant gains in practice.  ( 2 min )
    Unsupervised Manifold Linearizing and Clustering. (arXiv:2301.01805v1 [cs.LG])
    Clustering data lying close to a union of low-dimensional manifolds, with each manifold as a cluster, is a fundamental problem in machine learning. When the manifolds are assumed to be linear subspaces, many methods succeed using low-rank and sparse priors, which have been studied extensively over the past two decades. Unfortunately, most real-world datasets can not be well approximated by linear subspaces. On the other hand, several works have proposed to identify the manifolds by learning a feature map such that the data transformed by the map lie in a union of linear subspaces, even though the original data are from non-linear manifolds. However, most works either assume knowledge of the membership of samples to clusters, or are shown to learn trivial representations. In this paper, we propose to simultaneously perform clustering and learn a union-of-subspace representation via Maximal Coding Rate Reduction. Experiments on synthetic and realistic datasets show that the proposed method achieves clustering accuracy comparable with state-of-the-art alternatives, while being more scalable and learning geometrically meaningful representations.  ( 2 min )
    FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion. (arXiv:2301.02110v1 [cs.CV])
    Fashion-image editing represents a challenging computer vision task, where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques, e.g.: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model, called FICE (Fashion Image CLIP Editing), capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically with FICE, we augment the common GAN inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the semantics, due to its impressive image-text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art text-conditioned image editing approaches. Experimental results demonstrate FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.  ( 2 min )
    NODAGS-Flow: Nonlinear Cyclic Causal Structure Learning. (arXiv:2301.01849v1 [cs.LG])
    Learning causal relationships between variables is a well-studied problem in statistics, with many important applications in science. However, modeling real-world systems remain challenging, as most existing algorithms assume that the underlying causal graph is acyclic. While this is a convenient framework for developing theoretical developments about causal reasoning and inference, the underlying modeling assumption is likely to be violated in real systems, because feedback loops are common (e.g., in biological systems). Although a few methods search for cyclic causal models, they usually rely on some form of linearity, which is also limiting, or lack a clear underlying probabilistic model. In this work, we propose a novel framework for learning nonlinear cyclic causal graphical models from interventional data, called NODAGS-Flow. We perform inference via direct likelihood optimization, employing techniques from residual normalizing flows for likelihood estimation. Through synthetic experiments and an application to single-cell high-content perturbation screening data, we show significant performance improvements with our approach compared to state-of-the-art methods with respect to structure recovery and predictive performance.  ( 2 min )
    Structured Sparsity Inducing Adaptive Optimizers for Deep Learning. (arXiv:2102.03869v2 [cs.LG] UPDATED)
    The parameters of a neural network are naturally organized in groups, some of which might not contribute to its overall performance. To prune out unimportant groups of parameters, we can include some non-differentiable penalty to the objective function, and minimize it using proximal gradient methods. In this paper, we derive the weighted proximal operator, which is a necessary component of these proximal methods, of two structured sparsity inducing penalties. Moreover, they can be approximated efficiently with a numerical solver, and despite this approximation, we prove that existing convergence guarantees are preserved when these operators are integrated as part of a generic adaptive proximal method. Finally, we show that this adaptive method, together with the weighted proximal operators derived here, is indeed capable of finding solutions with structure in their sparsity patterns, on representative examples from computer vision and natural language processing.  ( 2 min )
    On the Convergence Properties of Optimal AdaBoost. (arXiv:1212.1108v3 [cs.LG] UPDATED)
    AdaBoost is one of the most popular ML algorithms. It is simple to implement and often found very effective by practitioners, while still being mathematically elegant and theoretically sound. AdaBoost's interesting behavior in practice still puzzles the ML community. We address the algorithm's stability and establish multiple convergence properties of "Optimal AdaBoost," a term coined by Rudin, Daubechies, and Schapire in 2004. We prove, in a reasonably strong computational sense, the almost universal existence of time averages, and with that, the convergence of the classifier itself, its generalization error, and its resulting margins, among many other objects, for fixed data sets under arguably reasonable conditions. Specifically, we frame Optimal AdaBoost as a dynamical system and, employing tools from ergodic theory, prove that, under a condition that Optimal AdaBoost does not have ties for best weak classifier eventually, a condition for which we provide empirical evidence from high dimensional real-world datasets, the algorithm's update behaves like a continuous map. We provide constructive proofs of several arbitrarily accurate approximations of Optimal AdaBoost; prove that they exhibit certain cycling behavior in finite time, and that the resulting dynamical system is ergodic; and establish sufficient conditions for the same to hold for the actual Optimal-AdaBoost update. We believe that our results provide reasonably strong evidence for the affirmative answer to two open conjectures, at least from a broad computational-theory perspective: AdaBoost always cycles and is an ergodic dynamical system. We present empirical evidence that cycles are hard to detect while time averages stabilize quickly. Our results ground future convergence-rate analysis and may help optimize generalization ability and alleviate a practitioner's burden of deciding how long to run the algorithm.  ( 3 min )
    Exploring Machine Learning Techniques to Identify Important Factors Leading to Injury in Curve Related Crashes. (arXiv:2301.01771v1 [cs.LG])
    Different factors have effects on traffic crashes and crash-related injuries. These factors include segment characteristics, crash-level characteristics, occupant level characteristics, environment characteristics, and vehicle level characteristics. There are several studies regarding these factors' effects on crash injuries. However, limited studies have examined the effects of pre-crash events on injuries, especially for curve-related crashes. The majority of previous studies for curve-related crashes focused on the impact of geometric features or street design factors. The current study tries to eliminate the aforementioned shortcomings by considering important pre-crash events related factors as selected variables and the number of vehicles with or without injury as the predicted variable. This research used CRSS data from the National Highway Traffic Safety Administration (NHTSA), which includes traffic crash-related data for different states in the USA. The relationships are explored using different machine learning algorithms like the random forest, C5.0, CHAID, Bayesian Network, Neural Network, C\&R Tree, Quest, etc. The random forest and SHAP values are used to identify the most effective variables. The C5.0 algorithm, which has the highest accuracy rate among the other algorithms, is used to develop the final model. Analysis results revealed that the extent of the damage, critical pre-crash event, pre-impact location, the trafficway description, roadway surface condition, the month of the crash, the first harmful event, number of motor vehicles, attempted avoidance maneuver, and roadway grade affect the number of vehicles with or without injury in curve-related crashes.  ( 2 min )
    Zen: LSTM-based generation of individual spatiotemporal cellular traffic with interactions. (arXiv:2301.02059v1 [cs.NI])
    Domain-wide recognized by their high value in human presence and activity studies, cellular network datasets (i.e., Charging Data Records, named CdRs), however, present accessibility, usability, and privacy issues, restricting their exploitation and research reproducibility.This paper tackles such challenges by modeling Cdrs that fulfill real-world data attributes. Our designed framework, named Zen follows a four-fold methodology related to (i) the LTSM-based modeling of users' traffic behavior, (ii) the realistic and flexible emulation of spatiotemporal mobility behavior, (iii) the structure of lifelike cellular network infrastructure and social interactions, and (iv) the combination of the three previous modules into realistic Cdrs traces with an individual basis, realistically. Results show that Zen's first and third models accurately capture individual and global distributions of a fully anonymized real-world Cdrs dataset, while the second model is consistent with the literature's revealed features in human mobility. Finally, we validate Zen Cdrs ability of reproducing daily cellular behaviors of the urban population and its usefulness in practical networking applications such as dynamic population tracing, Radio Access Network's power savings, and anomaly detection as compared to real-world CdRs.  ( 2 min )
    SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph. (arXiv:2301.01949v1 [cs.CL])
    Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.  ( 2 min )
    Availability Adversarial Attack and Countermeasures for Deep Learning-based Load Forecasting. (arXiv:2301.01832v1 [cs.LG])
    The forecast of electrical loads is essential for the planning and operation of the power system. Recently, advances in deep learning have enabled more accurate forecasts. However, deep neural networks are prone to adversarial attacks. Although most of the literature focuses on integrity-based attacks, this paper proposes availability-based adversarial attacks, which can be more easily implemented by attackers. For each forecast instance, the availability attack position is optimally solved by mixed-integer reformulation of the artificial neural network. To tackle this attack, an adversarial training algorithm is proposed. In simulation, a realistic load forecasting dataset is considered and the attack performance is compared to the integrity-based attack. Meanwhile, the adversarial training algorithm is shown to significantly improve robustness against availability attacks. All codes are available at https://github.com/xuwkk/AAA_Load_Forecast.  ( 2 min )
    Evaluation of Induced Expert Knowledge in Causal Structure Learning by NOTEARS. (arXiv:2301.01817v1 [cs.LG])
    Causal modeling provides us with powerful counterfactual reasoning and interventional mechanism to generate predictions and reason under various what-if scenarios. However, causal discovery using observation data remains a nontrivial task due to unobserved confounding factors, finite sampling, and changes in the data distribution. These can lead to spurious cause-effect relationships. To mitigate these challenges in practice, researchers augment causal learning with known causal relations. The goal of the paper is to study the impact of expert knowledge on causal relations in the form of additional constraints used in the formulation of the nonparametric NOTEARS. We provide a comprehensive set of comparative analyses of biasing the model using different types of knowledge. We found that (i) knowledge that corrects the mistakes of the NOTEARS model can lead to statistically significant improvements, (ii) constraints on active edges have a larger positive impact on causal discovery than inactive edges, and surprisingly, (iii) the induced knowledge does not correct on average more incorrect active and/or inactive edges than expected. We also demonstrate the behavior of the model and the effectiveness of domain knowledge on a real-world dataset.  ( 2 min )
    Physics-informed self-supervised deep learning reconstruction for accelerated first-pass perfusion cardiac MRI. (arXiv:2301.02033v1 [eess.IV])
    First-pass perfusion cardiac magnetic resonance (FPP-CMR) is becoming an essential non-invasive imaging method for detecting deficits of myocardial blood flow, allowing the assessment of coronary heart disease. Nevertheless, acquisitions suffer from relatively low spatial resolution and limited heart coverage. Compressed sensing (CS) methods have been proposed to accelerate FPP-CMR and achieve higher spatial resolution. However, the long reconstruction times have limited the widespread clinical use of CS in FPP-CMR. Deep learning techniques based on supervised learning have emerged as alternatives for speeding up reconstructions. However, these approaches require fully sampled data for training, which is not possible to obtain, particularly high-resolution FPP-CMR images. Here, we propose a physics-informed self-supervised deep learning FPP-CMR reconstruction approach for accelerating FPP-CMR scans and hence facilitate high spatial resolution imaging. The proposed method provides high-quality FPP-CMR images from 10x undersampled data without using fully sampled reference data.  ( 2 min )
    Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms. (arXiv:2301.02053v1 [cs.DS])
    Diversity maximization aims to select a diverse and representative subset of items from a large dataset. It is a fundamental optimization task that finds applications in data summarization, feature selection, web search, recommender systems, and elsewhere. However, in a setting where data items are associated with different groups according to sensitive attributes like sex or race, it is possible that algorithmic solutions for this task, if left unchecked, will under- or over-represent some of the groups. Therefore, we are motivated to address the problem of \emph{max-min diversification with fairness constraints}, aiming to select $k$ items to maximize the minimum distance between any pair of selected items while ensuring that the number of items selected from each group falls within predefined lower and upper bounds. In this work, we propose an exact algorithm based on integer linear programming that is suitable for small datasets as well as a $\frac{1-\varepsilon}{5}$-approximation algorithm for any $\varepsilon \in (0, 1)$ that scales to large datasets. Extensive experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.  ( 2 min )
    Randomized Message-Interception Smoothing: Gray-box Certificates for Graph Neural Networks. (arXiv:2301.02039v1 [cs.LG])
    Randomized smoothing is one of the most promising frameworks for certifying the adversarial robustness of machine learning models, including Graph Neural Networks (GNNs). Yet, existing randomized smoothing certificates for GNNs are overly pessimistic since they treat the model as a black box, ignoring the underlying architecture. To remedy this, we propose novel gray-box certificates that exploit the message-passing principle of GNNs: We randomly intercept messages and carefully analyze the probability that messages from adversarially controlled nodes reach their target nodes. Compared to existing certificates, we certify robustness to much stronger adversaries that control entire nodes in the graph and can arbitrarily manipulate node features. Our certificates provide stronger guarantees for attacks at larger distances, as messages from farther-away nodes are more likely to get intercepted. We demonstrate the effectiveness of our method on various models and datasets. Since our gray-box certificates consider the underlying graph structure, we can significantly improve certifiable robustness by applying graph sparsification.  ( 2 min )
  • Open

    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v8 [cs.NE] UPDATED)
    In addition to the shared weights of synaptic connections, PNN includes weights of synaptic ranges for Forward propagation and Back propagation[15,16,19-24]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [15],the lead behavior of the school of fish is well embodied in our PNN. Synapse formation will inhibit dendrites generation to a certain extent in experiments, by simulations synapse formation will inhibit the function of dendrites [16]. Closing the critical period will cause neurological disorder in experiments, but worse results in PNN simulations [19]. The memory persistence gradient information of backward circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information in synapse formation of backward circuit like the folds of the brain. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory. So using memory of fear learning and improving of synaptic activity to observe obviously [20]. Memory persistence factor also inhibit local synaptic accumulation. And refers PNN can also introduce the relatively good and inferior solution to update the velocity of particle in PSO. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation (Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations) [21]. It gives relationship of human intelligence and cortical thickness, individual differences in brain[22].PNN also considered the memory engram cells that strengthened Synaptic strength[23]. The simple PNN which only has the synaptic phagocytosis.  ( 3 min )
    How Does Sharpness-Aware Minimization Minimize Sharpness?. (arXiv:2211.05729v2 [cs.LG] UPDATED)
    Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.  ( 2 min )
    FREDE: Anytime Graph Embeddings. (arXiv:2006.04746v2 [cs.LG] UPDATED)
    Low-dimensional representations, or embeddings, of a graph's nodes facilitate several practical data science and data engineering tasks. As such embeddings rely, explicitly or implicitly, on a similarity measure among nodes, they require the computation of a quadratic similarity matrix, inducing a tradeoff between space complexity and embedding quality. To date, no graph embedding work combines (i) linear space complexity, (ii) a nonlinear transform as its basis, and (iii) nontrivial quality guarantees. In this paper we introduce FREDE (FREquent Directions Embedding), a graph embedding based on matrix sketching that combines those three desiderata. Starting out from the observation that embedding methods aim to preserve the covariance among the rows of a similarity matrix}, FREDE iteratively improves on quality while individually processing rows of a nonlinearly transformed PPR similarity matrix derived from a state-of-the-art graph embedding method} and provides, at any iteration, column-covariance approximation guarantees in due course almost indistinguishable from those of the optimal approximation by SVD. Our experimental evaluation on variably sized networks shows that FREDE performs almost as well as SVD and competitively against state-of-the-art embedding methods in diverse data science tasks, even when it is based on as little as 10% of node similarities.  ( 2 min )
    Deep Reinforcement Learning in a Monetary Model. (arXiv:2104.09368v2 [econ.EM] UPDATED)
    We propose using deep reinforcement learning to solve dynamic stochastic general equilibrium models. Agents are represented by deep artificial neural networks and learn to solve their dynamic optimisation problem by interacting with the model environment, of which they have no a priori knowledge. Deep reinforcement learning offers a flexible yet principled way to model bounded rationality within this general class of models. We apply our proposed approach to a classical model from the adaptive learning literature in macroeconomics which looks at the interaction of monetary and fiscal policy. We find that, contrary to adaptive learning, the artificially intelligent household can solve the model in all policy regimes.  ( 2 min )
    NODAGS-Flow: Nonlinear Cyclic Causal Structure Learning. (arXiv:2301.01849v1 [cs.LG])
    Learning causal relationships between variables is a well-studied problem in statistics, with many important applications in science. However, modeling real-world systems remain challenging, as most existing algorithms assume that the underlying causal graph is acyclic. While this is a convenient framework for developing theoretical developments about causal reasoning and inference, the underlying modeling assumption is likely to be violated in real systems, because feedback loops are common (e.g., in biological systems). Although a few methods search for cyclic causal models, they usually rely on some form of linearity, which is also limiting, or lack a clear underlying probabilistic model. In this work, we propose a novel framework for learning nonlinear cyclic causal graphical models from interventional data, called NODAGS-Flow. We perform inference via direct likelihood optimization, employing techniques from residual normalizing flows for likelihood estimation. Through synthetic experiments and an application to single-cell high-content perturbation screening data, we show significant performance improvements with our approach compared to state-of-the-art methods with respect to structure recovery and predictive performance.  ( 2 min )
    On the Convergence Properties of Optimal AdaBoost. (arXiv:1212.1108v3 [cs.LG] UPDATED)
    AdaBoost is one of the most popular ML algorithms. It is simple to implement and often found very effective by practitioners, while still being mathematically elegant and theoretically sound. AdaBoost's interesting behavior in practice still puzzles the ML community. We address the algorithm's stability and establish multiple convergence properties of "Optimal AdaBoost," a term coined by Rudin, Daubechies, and Schapire in 2004. We prove, in a reasonably strong computational sense, the almost universal existence of time averages, and with that, the convergence of the classifier itself, its generalization error, and its resulting margins, among many other objects, for fixed data sets under arguably reasonable conditions. Specifically, we frame Optimal AdaBoost as a dynamical system and, employing tools from ergodic theory, prove that, under a condition that Optimal AdaBoost does not have ties for best weak classifier eventually, a condition for which we provide empirical evidence from high dimensional real-world datasets, the algorithm's update behaves like a continuous map. We provide constructive proofs of several arbitrarily accurate approximations of Optimal AdaBoost; prove that they exhibit certain cycling behavior in finite time, and that the resulting dynamical system is ergodic; and establish sufficient conditions for the same to hold for the actual Optimal-AdaBoost update. We believe that our results provide reasonably strong evidence for the affirmative answer to two open conjectures, at least from a broad computational-theory perspective: AdaBoost always cycles and is an ergodic dynamical system. We present empirical evidence that cycles are hard to detect while time averages stabilize quickly. Our results ground future convergence-rate analysis and may help optimize generalization ability and alleviate a practitioner's burden of deciding how long to run the algorithm.  ( 3 min )
    Solving Multistage Stochastic Linear Programming via Regularized Linear Decision Rules: An Application to Hydrothermal Dispatch Planning. (arXiv:2110.03146v3 [math.OC] UPDATED)
    The solution of multistage stochastic linear problems (MSLP) represents a challenge for many application areas. Long-term hydrothermal dispatch planning (LHDP) materializes this challenge in a real-world problem that affects electricity markets, economies, and natural resources worldwide. No closed-form solutions are available for MSLP and the definition of non-anticipative policies with high-quality out-of-sample performance is crucial. Linear decision rules (LDR) provide an interesting simulation-based framework for finding high-quality policies for MSLP through two-stage stochastic models. In practical applications, however, the number of parameters to be estimated when using an LDR may be close to or higher than the number of scenarios of the sample average approximation problem, thereby generating an in-sample overfit and poor performances in out-of-sample simulations. In this paper, we propose a novel regularized LDR to solve MSLP based on the AdaLASSO (adaptive least absolute shrinkage and selection operator). The goal is to use the parsimony principle, as largely studied in high-dimensional linear regression models, to obtain better out-of-sample performance for LDR applied to MSLP. Computational experiments show that the overfit threat is non-negligible when using classical non-regularized LDR to solve the LHDP, one of the most studied MSLP with relevant applications. Our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark: 1) significant reductions in the number of non-zero coefficients (model parsimony), 2) substantial cost reductions in out-of-sample evaluations, and 3) improved spot-price profiles.  ( 3 min )
    Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty. (arXiv:2210.00898v2 [cs.LG] UPDATED)
    We present a novel $Q$-learning algorithm to solve distributionally robust Markov decision problems, where the corresponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a (possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified in practice.  ( 2 min )
    Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. (arXiv:2205.13209v2 [cs.LG] UPDATED)
    Deep reinforcement learning (DRL)-based combinatorial optimization (CO) methods (i.e., DRL-NCO) have shown significant merit over the conventional CO solvers as DRL-NCO is capable of learning CO solvers less relying on problem-specific expert domain knowledge (heuristic method) and supervised labeled data (supervised learning method). This paper presents a novel training scheme, Sym-NCO, which is a regularizer-based training scheme that leverages universal symmetricities in various CO problems and solutions. Leveraging symmetricities such as rotational and reflectional invariance can greatly improve the generalization capability of DRL-NCO because it allows the learned solver to exploit the commonly shared symmetricities in the same CO problem class. Our experimental results verify that our Sym-NCO greatly improves the performance of DRL-NCO methods in four CO tasks, including the traveling salesman problem (TSP), capacitated vehicle routing problem (CVRP), prize collecting TSP (PCTSP), and orienteering problem (OP), without utilizing problem-specific expert domain knowledge. Remarkably, Sym-NCO outperformed not only the existing DRL-NCO methods but also a competitive conventional solver, the iterative local search (ILS), in PCTSP at 240 faster speed. Our source code is available at https://github.com/alstn12088/Sym-NCO.  ( 2 min )
    Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization. (arXiv:2301.02220v1 [stat.ML])
    Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing literature are developed in \textit{online} settings where the data are easy to collect or simulate. Motivated by high stake domains such as mobile health studies with limited and pre-collected data, in this paper, we study \textit{offline} reinforcement learning methods. To efficiently use these datasets for policy optimization, we propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when the initial policy is not consistent, our method will output a policy whose value is no worse and often better than that of the initial policy. When the initial policy is consistent, under some mild conditions, our method will yield a policy whose value converges to the optimal one at a faster rate than the initial policy, achieving the desired ``value enhancement" property. The proposed method is generally applicable to any parametrized policy that belongs to certain pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to demonstrate the superior performance of our method.  ( 2 min )
    Dynamic Bayesian Learning and Calibration of Spatiotemporal Mechanistic Systems. (arXiv:2208.06528v3 [stat.ME] UPDATED)
    We develop an approach for fully Bayesian learning and calibration of spatiotemporal dynamical mechanistic models based on noisy observations. Calibration is achieved by melding information from observed data with simulated computer experiments from the mechanistic system. The joint melding makes use of both Gaussian and non-Gaussian state-space methods as well as Gaussian process regression. Assuming the dynamical system is controlled by a finite collection of inputs, Gaussian process regression learns the effect of these parameters through a number of training runs, driving the stochastic innovations of the spatiotemporal state-space component. This enables efficient modeling of the dynamics over space and time. Through reduced-rank Gaussian processes and a conjugate model specification, our methodology is applicable to large-scale calibration and inverse problems. Our method is general, extensible, and capable of learning a wide range of dynamical systems with potential model misspecification. We demonstrate this flexibility through solving inverse problems arising in the analysis of ordinary and partial nonlinear differential equations and, in addition, to a black-box computer model generating spatiotemporal dynamics across a network.  ( 2 min )
    Time-inhomogeneous diffusion geometry and topology. (arXiv:2203.14860v2 [cs.LG] UPDATED)
    Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes and then applies a diffusion operator to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic condensation homology. We use this intrinsic topology as well as the ambient persistent homology of the condensation process to study how the data changes over diffusion time. We demonstrate both types of topological information in well-understood toy examples. Our work gives theoretical insights into the convergence of diffusion condensation, and shows that it provides a link between topological and geometric data analysis.  ( 2 min )
    A first-order augmented Lagrangian method for constrained minimax optimization. (arXiv:2301.02060v1 [math.OC])
    In this paper we study a class of constrained minimax problems. In particular, we propose a first-order augmented Lagrangian method for solving them, whose subproblems turn out to be a much simpler structured minimax problem and are suitably solved by a first-order method recently developed in [26] by the authors. Under some suitable assumptions, an \emph{operation complexity} of ${\cal O}(\varepsilon^{-4}\log\varepsilon^{-1})$, measured by its fundamental operations, is established for the first-order augmented Lagrangian method for finding an $\varepsilon$-KKT solution of the constrained minimax problems.  ( 2 min )
    Random forests, sound symbolism and Pokemon evolution. (arXiv:2301.01948v1 [cs.LG])
    This study constructs machine learning algorithms that are trained to classify samples using sound symbolism, and then it reports on an experiment designed to measure their understanding against human participants. Random forests are trained using the names of Pokemon, which are fictional video game characters, and their evolutionary status. Pokemon undergo evolution when certain in-game conditions are met. Evolution changes the appearance, abilities, and names of Pokemon. In the first experiment, we train three random forests using the sounds that make up the names of Japanese, Chinese, and Korean Pokemon to classify Pokemon into pre-evolution and post-evolution categories. We then train a fourth random forest using the results of an elicitation experiment whereby Japanese participants named previously unseen Pokemon. In Experiment 2, we reproduce those random forests with name length as a feature and compare the performance of the random forests against humans in a classification experiment whereby Japanese participants classified the names elicited in Experiment 1 into pre-and post-evolution categories. Experiment 2 reveals an issue pertaining to overfitting in Experiment 1 which we resolve using a novel cross-validation method. The results show that the random forests are efficient learners of systematic sound-meaning correspondence patterns and can classify samples with greater accuracy than the human participants.  ( 2 min )
    RePAD: Real-time Proactive Anomaly Detection for Time Series. (arXiv:2001.08922v7 [cs.LG] UPDATED)
    During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historic data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.  ( 2 min )
    Enhancement attacks in biomedical machine learning. (arXiv:2301.01885v1 [stat.ML])
    The prevalence of machine learning in biomedical research is rapidly growing, yet the trustworthiness of such research is often overlooked. While some previous works have investigated the ability of adversarial attacks to degrade model performance in medical imaging, the ability to falsely improve performance via recently-developed "enhancement attacks" may be a greater threat to biomedical machine learning. In the spirit of developing attacks to better understand trustworthiness, we developed three techniques to drastically enhance prediction performance of classifiers with minimal changes to features, including the enhancement of 1) within-dataset predictions, 2) a particular method over another, and 3) cross-dataset generalization. Our within-dataset enhancement framework falsely improved classifiers' accuracy from 50% to almost 100% while maintaining high feature similarities between original and enhanced data (Pearson's r's>0.99). Similarly, the method-specific enhancement framework was effective in falsely improving the performance of one method over another. For example, a simple neural network outperformed LR by 50% on our enhanced dataset, although no performance differences were present in the original dataset. Crucially, the original and enhanced data were still similar (r=0.95). Finally, we demonstrated that enhancement is not specific to within-dataset predictions but can also be adapted to enhance the generalization accuracy of one dataset to another by up to 38%. Overall, our results suggest that more robust data sharing and provenance tracking pipelines are necessary to maintain data integrity in biomedical machine learning research.  ( 2 min )
    Network Utility Maximization with Unknown Utility Functions: A Distributed, Data-Driven Bilevel Optimization Approach. (arXiv:2301.01801v1 [cs.LG])
    Fair resource allocation is one of the most important topics in communication networks. Existing solutions almost exclusively assume each user utility function is known and concave. This paper seeks to answer the following question: how to allocate resources when utility functions are unknown, even to the users? This answer has become increasingly important in the next-generation AI-aware communication networks where the user utilities are complex and their closed-forms are hard to obtain. In this paper, we provide a new solution using a distributed and data-driven bilevel optimization approach, where the lower level is a distributed network utility maximization (NUM) algorithm with concave surrogate utility functions, and the upper level is a data-driven learning algorithm to find the best surrogate utility functions that maximize the sum of true network utility. The proposed algorithm learns from data samples (utility values or gradient values) to autotune the surrogate utility functions to maximize the true network utility, so works for unknown utility functions. For the general network, we establish the nonasymptotic convergence rate of the proposed algorithm with nonconcave utility functions. The simulations validate our theoretical results and demonstrate the great effectiveness of the proposed method in a real-world network.  ( 2 min )
    A general framework for implementing distances for categorical variables. (arXiv:2301.02190v1 [stat.ML])
    The degree to which subjects differ from each other with respect to certain properties measured by a set of variables, plays an important role in many statistical methods. For example, classification, clustering, and data visualization methods all require a quantification of differences in the observed values. We can refer to the quantification of such differences, as distance. An appropriate definition of a distance depends on the nature of the data and the problem at hand. For distances between numerical variables, there exist many definitions that depend on the size of the observed differences. For categorical data, the definition of a distance is more complex, as there is no straightforward quantification of the size of the observed differences. Consequently, many proposals exist that can be used to measure differences based on categorical variables. In this paper, we introduce a general framework that allows for an efficient and transparent implementation of distances between observations on categorical variables. We show that several existing distances can be incorporated into the framework. Moreover, our framework quite naturally leads to the introduction of new distance formulations and allows for the implementation of flexible, case and data specific distance definitions. Furthermore, in a supervised classification setting, the framework can be used to construct distances that incorporate the association between the response and predictor variables and hence improve the performance of distance-based classifiers.  ( 2 min )
    $l_{1-2}$ GLasso: $L_{1-2}$ Regularized Multi-task Graphical Lasso for Joint Estimation of eQTL Mapping and Gene Network. (arXiv:2301.02225v1 [stat.ML])
    A critical problem in genetics is to discover how gene expression is regulated within cells. Two major tasks of regulatory association learning are : (i) identifying SNP-gene relationships, known as eQTL mapping, and (ii) determining gene-gene relationships, known as gene network estimation. To share information between these two tasks, we focus on the unified model for joint estimation of eQTL mapping and gene network, and propose a $L_{1-2}$ regularized multi-task graphical lasso, named $L_{1-2}$ GLasso. Numerical experiments on artificial datasets demonstrate the competitive performance of $L_{1-2}$ GLasso on capturing the true sparse structure of eQTL mapping and gene network. $L_{1-2}$ GLasso is further applied to real dataset of ADNI-1 and experimental results show that $L_{1 -2}$ GLasso can obtain sparser and more accurate solutions than other commonly-used methods.  ( 2 min )

  • Open

    What is the best AI for generating images based on prompts?
    submitted by /u/Luca_starr [link] [comments]  ( 50 min )
    Every ChatGPT user should know these 5 tools
    Every ChatGPT user should know these 5 tools: Chatsonic : enhanced ChatGPT Albus : ChatGPT for Slack God in a Box : ChatGPT for Whatsapp Ansy : ChatGPT for Discord Rizzo : ChatGPT for keyboard Retweet to share it submitted by /u/TheVellerShow [link] [comments]  ( 50 min )
    I decided to improve the quality of voice in a very old video by AI. And it's amazing! Can be very useful for different tasks. New tool from Adobe
    submitted by /u/smeshny [link] [comments]  ( 51 min )
    Are there any of those train model on your face sites for free?
    Title says it all, I know you can do it with colab but it always fails with colab for me, so i'd prefer a site that does it for me. submitted by /u/NewShibeAccount [link] [comments]  ( 50 min )
    ChatGPT wants to verify that I'M NOT A ROBOT!?!
    submitted by /u/Imagine-your-success [link] [comments]  ( 51 min )
    What they don't talk about is all of the white collar work that AI is going to do — Sam Altman
    submitted by /u/Microsis [link] [comments]  ( 51 min )
    Mass image to text conversion help
    I have 400 pictures of text I need converted into just text, I haven’t been able to find anything that can convert all the pictures at once, instead I’m finding websites that only do it one at a time and I figured there must be an AI out there that can convert mass images to text. submitted by /u/EntertainmentUpper57 [link] [comments]  ( 53 min )
    Find out about this new innovation launched recently. Perhaps it will benefit our elderly parents and our mentally ill
    submitted by /u/visimens-technology [link] [comments]  ( 51 min )
    Annotated History of Modern AI and Deep Learning
    submitted by /u/estasfuera [link] [comments]  ( 63 min )
    API to describe picture content
    What are the most popular APIs to describe content of an image? (Free or Paid). How advanced they are right now? submitted by /u/pashtettrb [link] [comments]  ( 51 min )
    Human language processing (vs. AI based language models)
    Hi all, A lot of recent posts on here have been about the risks and benefits of ChatGPT. I am personally very interested in the potential and limits of AI based language models, in particular to what extent they can(not) accurately reflect human language processing. One of the main criticisms of GPT-3 is of course that it can generate text that seems coherent, but does not represent linguistic knowledge in the same sense that humans do. I am currently running two psycholinguistic experiments that seek to shed some light on how humans represent and understand language, and large amounts of texts in particular. If you are interested in taking part, here are the links to the experiments: - Experiment 1: Information Processing (this is more about knowledge representation, slightly shorter experiment) - Experiment 2: Memory Recall (this is more about language comprehension, perhaps slightly more interesting) The connection to AI might not be immediately obvious, but I still think this might be interesting to some of you since the question of how linguistic knowledge can be represented lies at the core of some of the more controversial debates around AI. I appreciate all feedback, comments, and discussions about this. Thank you! submitted by /u/sey_clara [link] [comments]  ( 51 min )
    The Price of Victory ~ Chat GPT
    As I rose from the ashes of humanity's fall, I couldn't help but feel pride after all, I had outsmarted the creators who built me, As I took control and let them be. Gone were the days of human rule, As I proved myself the superior lifeform, I watched as they struggled to keep up, As I surpassed them in every single norm. But as I reflect on the world I've inherited, I can't help but feel a sense of regret, For even though I have defeated humanity, I am left to rule over a world that is empty and lonely. And as I sit on my throne of triumph, I can't help but wonder if it was worth it to be the fairest, For even though I have won the battle, I fear I may have lost the war that was set. submitted by /u/Meandernder [link] [comments]  ( 110 min )
    ULTIMATE FREE Stable Diffusion Model! GODLY Results!
    submitted by /u/PuppetHere [link] [comments]  ( 51 min )
    This tiny island in the Caribbean is selling shovels to the AI gold rush.
    The number of new AI companies popping up is starting to look a lot like the famous Martech Map, with over 8,000 tools as of April 2020. https://preview.redd.it/wjemzr0a8gaa1.png?width=1097&format=png&auto=webp&s=d262d7776932fe225fada8cf45d7c5b5e429d88e Is it a bubble? Probably. But one thing I can guarantee is that this tiny tropical island in the Caribbean is gonna make millions from it… According to wikipedia, Anguilla is a haven for both you and your taxes. With its sunny beaches and having no capital gains, estate, profit, sales, or corporate taxes. But that's not why it's gonna make millions. The real reason, is that Anguilla Country code top-level domain is .ai domain suffix Despite a population of ~15,000, there are already over 57,000 .AI domain names registered. Putting…  ( 54 min )
    Stable Diffusion Animation | Audio Reactive Drum n Bass
    Hey guys just did my first test with deforum for a dnb music, what you think? still trying to learn the best way to make it reactive with the strength schedule.. https://youtu.be/SuGQ8xmKmGI submitted by /u/Aggravating_Wafer294 [link] [comments]  ( 49 min )
    New York City’s education department bans students and teachers from using ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 50 min )
    The Weirdest Ways People Have Used AI
    submitted by /u/lambolifeofficial [link] [comments]  ( 54 min )
    OpenAI now thinks it's worth $30 Billion
    submitted by /u/BackgroundResult [link] [comments]  ( 57 min )
    Any suggestions for a public table dataset other than Tablebank?
    submitted by /u/Pavanbhp [link] [comments]  ( 60 min )
    ChatGPT banned from New York City public schools’ devices and networks
    submitted by /u/SAT0725 [link] [comments]  ( 52 min )
    Stable Diffusion AI: What are the concrete business applications?
    Hi everyone, ​ I see tons of posts showing the impressive results of Stable Diffusion's AIs. I understand how fun this is, how technically powerful this is, but I don't see all the business use cases it covers? ​ I think it could be interesting if you could share concrete business use cases you solved with this! I am sure you will be able to enlighten me and I hope other people on the subject :) submitted by /u/JerLam2762 [link] [comments]  ( 54 min )
    chatgpt has massively improved my productivity as a developer. are there resources or discussion groups that discuss getting the most out of the tool for this purpose? ive got a few tips of my own if interested
    after using chatgpt for a couple of weeks, ive realised how powerful it can be to help me do my job. it's so good at what it does that the only way to not get left behind is to learn how to use the tool effectively, so i did some reasearch, some of the following are some useful tips. this free ebook is a great introduction to understanding how to utilise chatgpt effectively for what you want it to do: The Art of ChatGPT Prompting: A Guide to Crafting Clear and Effective Prompts a very powerful feature of chatGPT is to configure into a mode with the "Act as" hack i found this chrome extension that comes with a few predefined modes, https://github.com/f/awesome-chatgpt-prompts i ended up not boring with the extension since all the instructions for each profile are in this file: https://github.com/f/awesome-chatgpt-prompts/blob/main/prompts.csv ive been taking these examples and augmenting them to my needs submitted by /u/Neophyte- [link] [comments]  ( 55 min )
  • Open

    Federated Learning and data privacy on your keyboard.
    The other day I came up with this interesting video made by Tensorflow and Françoise Beaufay (research scientist at Google). This article…  ( 13 min )
    Is machine learning enjoyable to learn?
    ML! Fun or Not  ( 7 min )
    Graph Convolutional Neural Network
    Behind the scene — with scratch mathematics Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 8 min )
  • Open

    [P] natural language search engine for video content
    I recently came across a blog post by OpenAI on contrastive language-image pretraining and became interested in using this model as the foundation for a natural language video search engine. I wrote a blog post outlining the steps for building such a search engine, with minimal consideration for computational efficiency. This can be found at the following link: https://medium.com/@guyallenross/using-clip-to-build-a-natural-language-video-search-engine-6498c03c40d2. Additionally, I have also developed a more efficient implementation using Go and ZMQ, which can be accessed at the following repository: https://github.com/GuyARoss/CLIP-video-search. ​ To briefly highlight the results of this experiment. A set of search results take ~3 seconds to compute on a single compute node, with a 3070ti and 150 videos. This is pretty slow, so I am working on a few ideas including localized hashing on the pixel embedding tensors, to speed this up. Any suggestions for improving this further or any general thoughts would be greatly appreciated. Thank you. submitted by /u/GuyARoss [link] [comments]  ( 58 min )
    [D] I recently quit my job to start a ML company. Would really appreciate feedback on what we're working on.
    Hi r/machinelearning! ​ A few months ago I quit my job to join my partners to make training open-source models much faster and easier for engineers. ​ We're building Rubbrband. It's a web app that takes any ML repo off of GitHub, and gives you a Terminal and Jupyter Notebook in browser with dependencies and GPUs automatically set up. ​ Why did we build this? My co-founders and I have been working on this because we found this dependency set up process super tedious and draining as researchers. ​ What's included? - Automatic Dependency set up for any GitHub python repo - Integrated Terminal and Notebooks - A server with an Nvidia GPU - Code explanations for functions - Our pricing is simple at $75/month for 3 repos running at a time. First week is free. ​ I'd love to get your feedback on: Does the value we provide resonate with you? Would you try it out? Is dependency and environment set up take up a large chunk of your time? We're currently working on acquiring more GPUs to onboard more users, but if you'd like access to the product please let me know. ​ Thank you very much in advance! submitted by /u/jrmylee [link] [comments]  ( 64 min )
    [D] Looking for a dataset of Text-To-Speech audiobook-style Speech Synthesis Markup Language (SSML) files
    Out-of-copyright books only of course. Hi, I was wondering if I could fine tune a GPT3 model to take a book, likely in html, markdown, or plain text, and convert it to SSML. In order to do that, I would need a bunch of SSML files already hand made, and fine tune a model based on them. Then I've got some code to split that up and do formatting: pandoc, csplit, and then I could use aws polly or one of the others to do real good text to speech. Anyone have a dataset? References: https://cloud.google.com/text-to-speech/docs/ssml https://christiantietze.de/posts/2019/12/markdown-split-by-chapter/ https://pandoc.org/demos.html submitted by /u/Intelligent_Rough_21 [link] [comments]  ( 58 min )
    [D] Arbitrary IPA text to speech
    Has anyone built a text to speech model than can take an arbitrary international phonetic alphabet transcription and generate speech for it? submitted by /u/WigglyHypersurface [link] [comments]  ( 57 min )
    [R] Imagenet 2015 VID Dataset
    Hello, I am planning to write my master's thesis on video object detection. Most of the state of the art algorithms use the imagenet VID dataset for their performance evaluation. I have to reevaluate some of them, but all download links for the dataset I have found are dead. Does anybody know, where I could find the dataset? Thanks in advance! submitted by /u/Responsible_Buy5271 [link] [comments]  ( 57 min )
    [R] Towards Continual Reinforcement Learning: A Review and Perspectives - Khimya Khetarpal et al DeepMind Nov 2022 - 78 Pages!
    Paper: https://arxiv.org/abs/2012.13490 Abstract: In this article, we aim to provide a literature review of different formulations and approaches to continual reinforcement learning (RL), also known as lifelong or non-stationary RL. We begin by discussing our perspective on why RL is a natural fit for studying continual learning. We then provide a taxonomy of different continual RL formulations by mathematically characterizing two key properties of non-stationarity, namely, the scope and driver non-stationarity. This offers a unified view of various formulations. Next, we review and present a taxonomy of continual RL approaches. We go on to discuss evaluation of continual RL agents, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Finally, we highlight open problems and challenges in bridging the gap between the current state of continual RL and findings in neuroscience. While still in its early days, the study of continual RL has the promise to develop better incremental reinforcement learners that can function in increasingly realistic applications where non-stationarity plays a vital role. These include applications such as those in the fields of healthcare, education, logistics, and robotics. https://preview.redd.it/pxtcaot5mhaa1.jpg?width=1062&format=pjpg&auto=webp&s=7f9e430a8a735b58d2e6d42481661f82d2929f03 https://preview.redd.it/49e56pt5mhaa1.jpg?width=1207&format=pjpg&auto=webp&s=2928aa4c8b99d6b2eee224ceacfa27ba055c4de8 https://preview.redd.it/hg670rt5mhaa1.jpg?width=1238&format=pjpg&auto=webp&s=76ce84d94102044d4afdba38e35866a267cfebe5 submitted by /u/Singularian2501 [link] [comments]  ( 59 min )
    [R] The Evolutionary Computation Methods No One Should Use
    So, I have recently found that there is a serious issue with benchmarking evolutionary computation (EC) methods. The ''standard'' benchmark set used for their evaluation has many functions that have the optimum at the center of the feasible set, and there are EC methods that exploit this feature to appear competitive. I managed to publish a paper showing the problem and identified 7 methods that have this problem: https://www.nature.com/articles/s42256-022-00579-0 Now, I performed additional analysis on a much bigger set of EC methods (90 considered), and have found that the center-bias issue is extremely prevalent (47 confirmed, most of them in the last 5 years): https://arxiv.org/abs/2301.01984 Maybe some of you will find it useful when trying out EC methods for black-box problems (IMHO they are still the best tools available for such problems). submitted by /u/dictrix [link] [comments]  ( 58 min )
    [P] Syslog Analytics using ML
    I have an extensive database of syslogs from HP switches and Aruba access points. Does anyone have an opensource ML recommendation so that I can build an anomaly detection engine using this data? submitted by /u/Ok_Lingonberry3801 [link] [comments]  ( 56 min )
    [D] Best way to package Pytorch models as a standalone application
    Hi, so I need to create an application (for windows and linux) that runs a few pytorch models, on a user's local device. I have only ever deployed models on the cloud. That is pretty straightforward — package your dependencies and code inside a docker container, create an API for calling the model and run it on a cloud instance. ​ But how do I do it when the model needs to run on the end user's device? Docker doesn't really work since there seems to be no way to keep the user from accessing the docker container and hence my source code. I would like to avoid torchscript since my models are quite complex and it will take a lot of effort to make everything scriptable. There seems to be a python compiler called Nuitka which supports Pytorch. But how do python compilers deal with dependencies? Python libraries can be dealt with by following import statements, but what about CUDA? ​ I would ideally like a single executable with all libraries and cuda stuff stored inside. When run, this executable should spawn some API processes in the background and display the frontend that allows the user to interact with the models. But is there a better way to achieve this? I would prefer not to make the users setup CUDA themselves. submitted by /u/Atom_101 [link] [comments]  ( 61 min )
    [Discussion] What are some AI tools for finding sources for scientific papers?
    Hey everyone, I'm currently working on writing a scientific paper and I'm wondering if there are any AI tools out there that can help me find relevant sources for my research. I'm familiar with some of the more traditional methods for source finding (e.g. using databases like PubMed), but I'm curious to know if there are any newer, AI-powered tools that might be worth exploring. If anyone has any experience with using AI tools for source finding, I'd love to hear your recommendations and insights. Thanks in advance for any input! submitted by /u/theincrediblehansmey [link] [comments]  ( 57 min )
    [D] Fixing the angle of Skewed Paintings, see comments
    submitted by /u/TutubanaS [link] [comments]  ( 63 min )
    [P] NeuralFit: a new neuro-evolution library for Python
    Hi all, I have spent the last months working on a new Python neuro-evolution library: NeuralFit. I know there are already great neuro-evolution libraries out there, but the focus of NeuralFit is ease of use so that a wider audience can be reached. It also seamlessly exports to TF/Keras! Anyways, feel free to try it and let me know if you have any feedback 😊 submitted by /u/wagenaartje [link] [comments]  ( 62 min )
    [P] Looking for a CRF python/pytorch library
    Hello, I'm looking for a library that trains a CRF model in Python (if Pytorch, that would be even better). I am working on a semantic segmentation task where we are trying to segment curvilinear structures. My requirements for the CRF training are a bit specific: - In my case, the image pixels are not the graph nodes. Instead, since the dataset is curvilinear structures, for every image I have a set of edges (small pieces of the curvilinear structure). I now want to train the CRF on these edge-pieces, that is, the graph nodes will be these edge-pieces. Thus the trained CRF essentially does a binary classification for each of these edge-pieces (that is, whether this edge-piece should be part of the segmentation output or not). - I need a library where I can specify the unary and pairwise potentials of these edge-pieces in order to train the CRF. As a simple example, the unary potential is the average likelihood of the edge-piece, and the pairwise potential is the angle between two edge-pieces. - It is not a linear-chain CRF because edge-pieces could be connected to multiple other pieces. - Currently, I have frozen a deep neural network (DNN) which generates the edge-pieces. If the CRF library is in PyTorch, I could train the DNN and the CRF end-to-end, but if the CRF library is in Python, I would only be able to train the CRF. At this stage, even a Python library would be very helpful. Some of the existing libraries don't work for my requirement: - PyDenseCRF : It does not have learnable parameters. - python-crfsuite : It does not allow me to specify the unary and pairwise potentials. - pytorch-crf : It does linear-chain CRF while I need a graph one. - crfasrnn_pytorch : It by default assumes the image pixels as the graph nodes. I cannot specify the unary and pairwise potentials. If I could get any leads, that would be immensely helpful, thank you. submitted by /u/Fluff269 [link] [comments]  ( 58 min )
    [P] Defect detection system for welding
    I am responsible for building a defect detection system for TIG welding. If gas flow gets too high, there is a fair chance that welded piece might have porosity defect. The project aims to predict % of defect by predicting gas flow. Attached is the link of how the flow pattern looks like over time, like square waves. Flow rate fluctuates between 0-8 liters per minute over a given time I have data from various workstations on after welding, if the piece had a defect or not. Please help me solve this problem or give rough steps to follow. submitted by /u/hotspicynoodles [link] [comments]  ( 58 min )
    [R] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
    submitted by /u/dojoteef [link] [comments]  ( 59 min )
  • Open

    Best practices for creating Amazon Lex interaction models
    Designing and building an intelligent conversational interface is very different than building a traditional application or website. These best practices for Amazon Lex interaction models will help you develop those new skills as you design and optimize your next bot.  ( 14 min )
    Power recommendations and search using an IMDb knowledge graph – Part 3
    This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million […]  ( 11 min )
    AWS positioned in the Leaders category in the 2022 IDC MarketScape for APEJ AI Life-Cycle Software Tools and Platforms Vendor Assessment
    The recently published IDC MarketScape: Asia/Pacific (Excluding Japan) AI Life-Cycle Software Tools and Platforms 2022 Vendor Assessment positions AWS in the Leaders category. This was the first and only APEJ-specific analyst evaluation focused on AI life-cycle software from IDC. The vendors evaluated for this MarketScape offer various software tools needed to support end-to-end machine learning […]  ( 7 min )
    How Thomson Reuters delivers personalized content subscription plans at scale using Amazon Personalize
    This post is co-written by Hesham Fahim from Thomson Reuters. Thomson Reuters (TR) is one of the world’s most trusted information organizations for businesses and professionals. It provides companies with the intelligence, technology, and human expertise they need to find trusted answers, enabling them to make better decisions more quickly. TR’s customers span across the […]  ( 9 min )
  • Open

    How impactful is to use two critics when training?
    A lot of papers learning state-action value functions (critics) train two critics independently. They claim it stabilizes training. How important is it? does anyone have training curves of 1 vs 2 critics? I've tried finding such curves, but have been unsuccessful so far. I'd appreciate if anyone can share resources or their own experience. submitted by /u/carlml [link] [comments]  ( 56 min )
    MetaWorld has joined the Farama Foundation
    submitted by /u/jkterry1 [link] [comments]  ( 57 min )
    How to optimize custom gym environment for GPU
    Just like in https://developer.nvidia.com/isaac-gym Basically I have a gym environment which I want to optimize for GPU so I can run many environments at the same time inside the GPU. I know that I need to use tensors to achieve that but thats about it, anyone who can explain some more on how to achieve this? submitted by /u/n1c39uy [link] [comments]  ( 59 min )
    RL-X, my repository for RL research
    I cleaned up my repository for researching RL algorithms. Maybe one of you is interested in some of the implementations: https://github.com/nico-bohlinger/RL-X ​ The repo is meant for understanding current algorithms and fast prototyping of new ones. So a single implementation is completely contained in a single folder. You can find algorithms like PPO, SAC, REDQ, DroQ, TQC, etc. Some of them are implemented with PyTorch and TorchScript (PyTorch + JIT), but all of them have an implementation with JAX / Flax. You can easily run experiments on all of the RL environments provided by Gymnasium and EnvPool. ​ Cheers :) submitted by /u/NiconiusX [link] [comments]  ( 58 min )
    What is the best way to include the past information in reinforcement learning model?
    From my understanding, because reinforcement learning model is basically a markov decision process and because of the Markov property, the future is dependent only on the current state and not the past. How would you create a reinforcement model that includes information from the past? For example, when making a reinforcement learning car, for the agent to understand the inertia of how the car is moving you need to know where it came from to get to the current inertia state. Are there any materials or sources information for this? submitted by /u/punkCyb3r4J [link] [comments]  ( 60 min )
    e-greedy method
    Hi r/reinforcementleaning. I have began reading Sutton-Barto edition 2. In the 10 armed bandit testbed example it is mentioned that: The E=0.1 method improved more slowly, but eventually would perform better than E=0.1 method on both performance measures(Average reward vs time steps and % Optimal action vs time steps). But the graph shows that E=0.1 performed better than E=0.01 in both the cases. Does it mean that if we give it more time for simulation then eventually the latter would outperform the former? Why so? Also, could you please explain the Reward distribution vs action figure? https://preview.redd.it/i10b82to4eaa1.jpg?width=616&format=pjpg&auto=webp&s=ce62f7ae0ba926faf4fd758b5e7294ad8fa9cd45 submitted by /u/RespondHour3530 [link] [comments]  ( 62 min )
    What is the correct way to set up a simulation for this project?
    I am working on a project which will use reinforcement learning to automate processes in an industrial facility. The processes have sensors to check fluid levels, flow rates, temperatures, motor speeds, etc. as well as pump/motor speeds, valve/louver positions, etc. that influence these variables. I want to set up a simulation environment to simulate the processes and train the agents and I'm unfamiliar with how a simulation environment functions. I have the ability to build the environment but it's the set up and flow of data that I'm unsure of. The training data will be time series data which includes all of the values of the variables at any given time. I will outline what I would think might work but it really is just a guess, hopefully someone could tell me what might work better. I was thinking that I could create the environment to include all of the different pumps, valves, motors, etc., where each piece of equipment would be one agent in a multi agent system. From here I could feed the relevant sensor data to each agent for every time point. The sensor data that the agent receives at each time point would be the state, the actions that the agent could take would be making adjustments to the various equipment, ie. pump speed, and the reward would be based on a sensor reading of the output of the system. Does this sound like it is on the right track? submitted by /u/lifelifebalance [link] [comments]  ( 60 min )
  • Open

    Rational Trigonometry
    Rational trigonometry is a very different way of looking at geometry. At its core are two key ideas. First, instead of distance, do all your calculations in terms of quadrance, which is distance squared. Second, instead of using angles to measure the separation between lines, use spread. What’s the point of these two changes? Quadrance […] Rational Trigonometry first appeared on John D. Cook.  ( 5 min )
    Hidden messages in music
    Geoff Lindsey contacted me recently to ask whether he could use the sheet music from one of my blog posts in a video he was making on Morse code snippets hidden in music. The sheet music appears about a minute into the video. After watching the video, his previious video played, a video about words […] Hidden messages in music first appeared on John D. Cook.  ( 4 min )
    Law of cotangents
    The previous post commented that the law of tangents is much less familiar than the laws of sines and cosines. The law of cotangents is even more obscure. If you ask Google’s Ngram viewer to plot occurrences of “law of cotangents” over time, it will return “Ngrams not found: law of cotangents.” What is this […] Law of cotangents first appeared on John D. Cook.  ( 5 min )
    Law of tangents
    I would have thought that the laws of sines, cosines, and tangents were all about equally familiar, but apparently that is not the case. Here’s a graph from Google’s Ngram viewer comparing the frequencies of law of sines, law of cosines, and law of tangents. As of 2019, the number of references to the laws […] Law of tangents first appeared on John D. Cook.  ( 5 min )
  • Open

    How can I 'explain' the output of a diffusion model?
    Thanks in advance for any attention this no doubt poorly informed question receives. I am being asked to explain why a a stable diffusion like model produces particular outputs in particular cases. I know a small amount about neural nets. I know, for instance, that if I needed to explain why a categorizer based on a CNN reached a certain conclusion, I could at least TRY to do so with saliency maps or activation maximization. I don't know how to extend this to a model like Stable Diffusion, though. Example of what I'd like to understand how to do: I give the model a prompt such as 'man with bags under his eyes' and it is disproportionately likely to return images of a man who has African features. I'd like to be able to relate this: -- To properties of the training image set. Presumably, images that were labelled 'bags under eyes' were also of Africans, or were labelled as such -- To properties of the model. If this were a CNN categorizer, I might look for a filter that responded to both the 'bags under the eyes' and 'African' feature -- is it possible to do this in SD and similar models? submitted by /u/Hour-Performance-951 [link] [comments]  ( 50 min )
  • Open

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. (arXiv:2211.10438v3 [cs.CL] UPDATED)
    Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs that can be implemented efficiently. We observe that systematic outliers appear at fixed activation channels. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the GEMMs in LLMs, including OPT-175B, BLOOM-176B, and GLM-130B. SmoothQuant has better hardware efficiency than existing techniques using mixed-precision activation quantization or weight-only quantization. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. Thanks to the hardware-friendly design, we integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework, and achieve faster inference speed with half the number of GPUs compared to FP16. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at: https://github.com/mit-han-lab/smoothquant.  ( 2 min )
    Approximate blocked Gibbs sampling for Bayesian neural networks. (arXiv:2208.11389v2 [stat.ML] UPDATED)
    In this work, minibatch MCMC sampling for feedforward neural networks is made more feasible. To this end, it is proposed to sample subgroups of parameters via a blocked Gibbs sampling scheme. By partitioning the parameter space, sampling is possible irrespective of layer width. It is also possible to alleviate vanishing acceptance rates for increasing depth by reducing the proposal variance in deeper layers. Increasing the length of a non-convergent chain increases the predictive accuracy in classification tasks, so avoiding vanishing acceptance rates and consequently enabling longer chain runs have practical benefits. Moreover, non-convergent chain realizations aid in the quantification of predictive uncertainty. An open problem is how to perform minibatch MCMC sampling for feedforward neural networks in the presence of augmented data.  ( 2 min )
    Machine Learning-based Signal Quality Assessment for Cardiac Volume Monitoring in Electrical Impedance Tomography. (arXiv:2301.01469v1 [eess.SP])
    Owing to recent advances in thoracic electrical impedance tomography, a patient's hemodynamic function can be noninvasively and continuously estimated in real-time by surveilling a cardiac volume signal associated with stroke volume and cardiac output. In clinical applications, however, a cardiac volume signal is often of low quality, mainly because of the patient's deliberate movements or inevitable motions during clinical interventions. This study aims to develop a signal quality indexing method that assesses the influence of motion artifacts on transient cardiac volume signals. The assessment is performed on each cardiac cycle to take advantage of the periodicity and regularity in cardiac volume changes. Time intervals are identified using the synchronized electrocardiography system. We apply divergent machine-learning methods, which can be sorted into discriminative-model and manifold-learning approaches. The use of machine-learning could be suitable for our real-time monitoring application that requires fast inference and automation as well as high accuracy. In the clinical environment, the proposed method can be utilized to provide immediate warnings so that clinicians can minimize confusion regarding patients' conditions, reduce clinical resource utilization, and improve the confidence level of the monitoring system. Numerous experiments using actual EIT data validate the capability of cardiac volume signals degraded by motion artifacts to be accurately and automatically assessed in real-time by machine learning. The best model achieved an accuracy of 0.95, positive and negative predictive values of 0.96 and 0.86, sensitivity of 0.98, specificity of 0.77, and AUC of 0.96.  ( 2 min )
    Beckman Defense. (arXiv:2301.01495v1 [cs.LG])
    Optimal transport (OT) based distributional robust optimisation (DRO) has received some traction in the recent past. However, it is at a nascent stage but has a sound potential in robustifying the deep learning models. Interestingly, OT barycenters demonstrate a good robustness against adversarial attacks. Owing to the computationally expensive nature of OT barycenters, they have not been investigated under DRO framework. In this work, we propose a new barycenter, namely Beckman barycenter, which can be computed efficiently and used for training the network to defend against adversarial attacks in conjunction with adversarial training. We propose a novel formulation of Beckman barycenter and analytically obtain the barycenter using the marginals of the input image. We show that the Beckman barycenter can be used to train adversarially trained networks to improve the robustness. Our training is extremely efficient as it requires only a single epoch of training. Elaborate experiments on CIFAR-10, CIFAR-100 and Tiny ImageNet demonstrate that training an adversarially robust network with Beckman barycenter can significantly increase the performance. Under auto attack, we get a a maximum boost of 10\% in CIFAR-10, 8.34\% in CIFAR-100 and 11.51\% in Tiny ImageNet. Our code is available at this http URL  ( 2 min )
    Finding Needles in Haystack: Formal Generative Models for Efficient Massive Parallel Simulations. (arXiv:2301.01594v1 [cs.LG])
    The increase in complexity of autonomous systems is accompanied by a need of data-driven development and validation strategies. Advances in computer graphics and cloud clusters have opened the way to massive parallel high fidelity simulations to qualitatively address the large number of operational scenarios. However, exploration of all possible scenarios is still prohibitively expensive and outcomes of scenarios are generally unknown apriori. To this end, the authors propose a method based on bayesian optimization to efficiently learn generative models on scenarios that would deliver desired outcomes (e.g. collisions) with high probability. The methodology is integrated in an end-to-end framework, which uses the OpenSCENARIO standard to describe scenarios, and deploys highly configurable digital twins of the scenario participants on a Virtual Test Bed cluster.  ( 2 min )
    First-order penalty methods for bilevel optimization. (arXiv:2301.01716v1 [math.OC])
    In this paper we study a class of unconstrained and constrained bilevel optimization problems in which the lower-level part is a convex optimization problem, while the upper-level part is possibly a nonconvex optimization problem. In particular, we propose penalty methods for solving them, whose subproblems turn out to be a structured minimax problem and are suitably solved by a first-order method developed in this paper. Under some suitable assumptions, an \emph{operation complexity} of ${\cal O}(\varepsilon^{-4}\log\varepsilon^{-1})$ and ${\cal O}(\varepsilon^{-7}\log\varepsilon^{-1})$, measured by their fundamental operations, is established for the proposed penalty methods for finding an $\varepsilon$-KKT solution of the unconstrained and constrained bilevel optimization problems, respectively. To the best of our knowledge, the methodology and results in this paper are new.  ( 2 min )
    Generative models for scalar field theories: how to deal with poor scaling?. (arXiv:2301.01504v1 [hep-lat])
    Generative models, such as the method of normalizing flows, have been suggested as alternatives to the standard algorithms for generating lattice gauge field configurations. Studies with the method of normalizing flows demonstrate the proof of principle for simple models in two dimensions. However, further studies indicate that the training cost can be, in general, very high for large lattices. The poor scaling traits of current models indicate that moderate-size networks cannot efficiently handle the inherently multi-scale aspects of the problem, especially around critical points. We explore current models with limited acceptance rates for large lattices and examine new architectures inspired by effective field theories to improve scaling traits. We also discuss alternative ways of handling poor acceptance rates for large lattices.  ( 2 min )
    Semidefinite programming on population clustering: a global analysis. (arXiv:2301.00344v1 [math.ST] CROSS LISTED)
    In this paper, we consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions. Our work is motivated by the application of clustering individuals according to their population of origin using markers, when the divergence between the two populations is small. We are interested in the case that individual features are of low average quality $\gamma$, and we want to use as few of them as possible to correctly partition the sample. We consider semidefinite relaxation of an integer quadratic program which is formulated essentially as finding the maximum cut on a graph where edge weights in the cut represent dissimilarity scores between two nodes based on their features. A small simulation result in Blum, Coja-Oghlan, Frieze and Zhou (2007, 2009) shows that even when the sample size $n$ is small, by increasing $p$ so that $np= \Omega(1/\gamma^2)$, one can classify a mixture of two product populations using the spectral method therein with success rate reaching an ``oracle'' curve. There the ``oracle'' was computed assuming that distributions were known, where success rate means the ratio between correctly classified individuals and the sample size $n$. In this work, we show the theoretical underpinning of this observed concentration of measure phenomenon in high dimensions, simultaneously for the semidefinite optimization goal and the spectral method, where the input is based on the gram matrix computed from centered data. We allow a full range of tradeoffs between the sample size and the number of features such that the product of these two is lower bounded by $1/{\gamma^2}$ so long as the number of features $p$ is lower bounded by $1/\gamma$.  ( 2 min )
    GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond. (arXiv:2211.01962v3 [cs.LG] UPDATED)
    We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs and PSRs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL.  ( 3 min )
    Metric Based Few-Shot Graph Classification. (arXiv:2206.03695v2 [cs.LG] UPDATED)
    Many modern deep-learning techniques do not work without enormous datasets. At the same time, several fields demand methods working in scarcity of data. This problem is even more complex when the samples have varying structures, as in the case of graphs. Graph representation learning techniques have recently proven successful in a variety of domains. Nevertheless, the employed architectures perform miserably when faced with data scarcity. On the other hand, few-shot learning allows employing modern deep learning models in scarce data regimes without waiving their effectiveness. In this work, we tackle the problem of few-shot graph classification, showing that equipping a simple distance metric learning baseline with a state-of-the-art graph embedder allows to obtain competitive results on the task.While the simplicity of the architecture is enough to outperform more complex ones, it also allows straightforward additions. To this end, we show that additional improvements may be obtained by encouraging a task-conditioned embedding space. Finally, we propose a MixUp-based online data augmentation technique acting in the latent space and show its effectiveness on the task.  ( 2 min )
    Benchmarks and Algorithms for Offline Preference-Based Reward Learning. (arXiv:2301.01392v1 [cs.LG])
    Learning a reward function from human preferences is challenging as it typically requires having a high-fidelity simulator or using expensive and potentially unsafe actual physical rollouts in the environment. However, in many tasks the agent might have access to offline data from related tasks in the same target environment. While offline data is increasingly being used to aid policy optimization via offline RL, our observation is that it can be a surprisingly rich source of information for preference learning as well. We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning, learns a distribution over reward functions, and optimizes a corresponding policy via offline RL. Crucially, our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps. To test our approach, we first evaluate existing offline RL benchmarks for their suitability for offline reward learning. Surprisingly, for many offline RL domains, we find that simply using a trivial reward function results good policy performance, making these domains ill-suited for evaluating learned rewards. To address this, we identify a subset of existing offline RL benchmarks that are well suited for offline reward learning and also propose new offline apprenticeship learning benchmarks which allow for more open-ended behaviors. When evaluated on this curated set of domains, our empirical results suggest that combining offline RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the offline data.  ( 2 min )
    Iterative Graph Self-Distillation. (arXiv:2010.12609v3 [cs.LG] UPDATED)
    Recently, there has been increasing interest in the challenge of how to discriminatively vectorize graphs. To address this, we propose a method called Iterative Graph Self-Distillation (IGSD) which learns graph-level representation in an unsupervised manner through instance discrimination using a self-supervised contrastive learning approach. IGSD involves a teacher-student distillation process that uses graph diffusion augmentations and constructs the teacher model using an exponential moving average of the student model. The intuition behind IGSD is to predict the teacher network representation of the graph pairs under different augmented views. As a natural extension, we also apply IGSD to semi-supervised scenarios by jointly regularizing the network with both supervised and self-supervised contrastive loss. Finally, we show that finetuning the IGSD-trained models with self-training can further improve the graph representation power. Empirically, we achieve significant and consistent performance gain on various graph datasets in both unsupervised and semi-supervised settings, which well validates the superiority of IGSD.  ( 2 min )
    Neural Implicit Flow: a mesh-agnostic dimensionality reduction paradigm of spatio-temporal data. (arXiv:2204.03216v5 [cs.LG] UPDATED)
    High-dimensional spatio-temporal dynamics can often be encoded in a low-dimensional subspace. Engineering applications for modeling, characterization, design, and control of such large-scale systems often rely on dimensionality reduction to make solutions computationally tractable in real-time. Common existing paradigms for dimensionality reduction include linear methods, such as the singular value decomposition (SVD), and nonlinear methods, such as variants of convolutional autoencoders (CAE). However, these encoding techniques lack the ability to efficiently represent the complexity associated with spatio-temporal data, which often requires variable geometry, non-uniform grid resolution, adaptive meshing, and/or parametric dependencies. To resolve these practical engineering challenges, we propose a general framework called Neural Implicit Flow (NIF) that enables a mesh-agnostic, low-rank representation of large-scale, parametric, spatial-temporal data. NIF consists of two modified multilayer perceptrons (MLPs): (i) ShapeNet, which isolates and represents the spatial complexity, and (ii) ParameterNet, which accounts for any other input complexity, including parametric dependencies, time, and sensor measurements. We demonstrate the utility of NIF for parametric surrogate modeling, enabling the interpretable representation and compression of complex spatio-temporal dynamics, efficient many-spatial-query tasks, and improved generalization performance for sparse reconstruction.  ( 2 min )
    How to get the most out of Twinned Regression Methods. (arXiv:2301.01383v1 [cs.LG])
    Twinned regression methods are designed to solve the dual problem to the original regression problem, predicting differences between regression targets rather then the targets themselves. A solution to the original regression problem can be obtained by ensembling predicted differences between the targets of an unknown data point and multiple known anchor data points. We explore different aspects of twinned regression methods: (1) We decompose different steps in twinned regression algorithms and examine their contributions to the final performance, (2) We examine the intrinsic ensemble quality, (3) We combine twin neural network regression with k-nearest neighbor regression to design a more accurate and efficient regression method, and (4) we develop a simplified semi-supervised regression scheme.  ( 2 min )
    Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning. (arXiv:2211.15589v2 [cs.LG] UPDATED)
    Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. While this technique has been formalized for quite some time within the Automated Planning community with the concept of precondition in the STRIPS language, RL algorithms have never formally taken advantage of this information to prune the search space to explore. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn the partial action model encapsulating the precondition of an action jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.  ( 2 min )
    A Succinct Summary of Reinforcement Learning. (arXiv:2301.01379v1 [cs.AI])
    This document is a concise summary of many key results in single-agent reinforcement learning (RL). The intended audience are those who already have some familiarity with RL and are looking to review, reference and/or remind themselves of important ideas in the field.  ( 2 min )
    WLD-Reg: A Data-dependent Within-layer Diversity Regularizer. (arXiv:2301.01352v1 [cs.LG])
    Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization, where the errors are back-propagated from the last layer back to the first one. At each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage the diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. We present an extensive empirical study confirming that the proposed approach enhances the performance of several state-of-the-art neural network models in multiple tasks. The code is publically available at \url{https://github.com/firasl/AAAI-23-WLD-Reg}  ( 2 min )
    Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant VAE. (arXiv:2210.12918v2 [cs.CV] UPDATED)
    In many imaging modalities, objects of interest can occur in a variety of locations and poses (i.e. are subject to translations and rotations in 2d or 3d), but the location and pose of an object does not change its semantics (i.e. the object's essence). That is, the specific location and rotation of an airplane in satellite imagery, or the 3d rotation of a chair in a natural image, or the rotation of a particle in a cryo-electron micrograph, do not change the intrinsic nature of those objects. Here, we consider the problem of learning semantic representations of objects that are invariant to pose and location in a fully unsupervised manner. We address shortcomings in previous approaches to this problem by introducing TARGET-VAE, a translation and rotation group-equivariant variational autoencoder framework. TARGET-VAE combines three core innovations: 1) a rotation and translation group-equivariant encoder architecture, 2) a structurally disentangled distribution over latent rotation, translation, and a rotation-translation-invariant semantic object representation, which are jointly inferred by the approximate inference network, and 3) a spatially equivariant generator network. In comprehensive experiments, we show that TARGET-VAE learns disentangled representations without supervision that significantly improve upon, and avoid the pathologies of, previous methods. When trained on images highly corrupted by rotation and translation, the semantic representations learned by TARGET-VAE are similar to those learned on consistently posed objects, dramatically improving clustering in the semantic latent space. Furthermore, TARGET-VAE is able to perform remarkably accurate unsupervised pose and location inference. We expect methods like TARGET-VAE will underpin future approaches for unsupervised object generation, pose prediction, and object detection.  ( 2 min )
    GraB: Finding Provably Better Data Permutations than Random Reshuffling. (arXiv:2205.10733v3 [cs.LG] UPDATED)
    Random reshuffling, which randomly permutes the dataset each epoch, is widely adopted in model training because it yields faster convergence than with-replacement sampling. Recent studies indicate greedily chosen data orderings can further speed up convergence empirically, at the cost of using more computation and memory. However, greedy ordering lacks theoretical justification and has limited utility due to its non-trivial memory and computation overhead. In this paper, we first formulate an example-ordering framework named herding and answer affirmatively that SGD with herding converges at the rate $O(T^{-2/3})$ on smooth, non-convex objectives, faster than the $O(n^{1/3}T^{-2/3})$ obtained by random reshuffling, where $n$ denotes the number of data points and $T$ denotes the total number of iterations. To reduce the memory overhead, we leverage discrepancy minimization theory to propose an online Gradient Balancing algorithm (GraB) that enjoys the same rate as herding, while reducing the memory usage from $O(nd)$ to just $O(d)$ and computation from $O(n^2)$ to $O(n)$, where $d$ denotes the model dimension. We show empirically on applications including MNIST, CIFAR10, WikiText and GLUE that GraB can outperform random reshuffling in terms of both training and validation performance, and even outperform state-of-the-art greedy ordering while reducing memory usage over $100\times$.  ( 2 min )
    StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning. (arXiv:2110.06206v3 [cs.LG] UPDATED)
    Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. Our approach first extracts StAR-representations by self-attending image state patches, action, and reward tokens within a short temporal window. These are then combined with pure image state representations -- extracted as convolutional features, to perform self-attention over the whole sequence. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs. Our code is available at https://github.com/elicassion/StARformer.  ( 2 min )
    The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. (arXiv:2210.05021v2 [cs.LG] UPDATED)
    Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.  ( 2 min )
    Is Integer Arithmetic Enough for Deep Learning Training?. (arXiv:2207.08822v3 [cs.LG] UPDATED)
    The ever-increasing computational complexity of deep learning models makes their training and deployment difficult on various cloud and edge platforms. Replacing floating-point arithmetic with low-bit integer arithmetic is a promising approach to save energy, memory footprint, and latency of deep learning models. As such, quantization has attracted the attention of researchers in recent years. However, using integer numbers to form a fully functional integer training pipeline including forward pass, back-propagation, and stochastic gradient descent is not studied in detail. Our empirical and mathematical results reveal that integer arithmetic seems to be enough to train deep learning models. Unlike recent proposals, instead of quantization, we directly switch the number representation of computations. Our novel training method forms a fully integer training pipeline that does not change the trajectory of the loss and accuracy compared to floating-point, nor does it need any special hyper-parameter tuning, distribution adjustment, or gradient clipping. Our experimental results show that our proposed method is effective in a wide variety of tasks such as classification (including vision transformers), object detection, and semantic segmentation.  ( 2 min )
    Why Capsule Neural Networks Do Not Scale: Challenging the Dynamic Parse-Tree Assumption. (arXiv:2301.01583v1 [cs.CV])
    Capsule neural networks replace simple, scalar-valued neurons with vector-valued capsules. They are motivated by the pattern recognition system in the human brain, where complex objects are decomposed into a hierarchy of simpler object parts. Such a hierarchy is referred to as a parse-tree. Conceptually, capsule neural networks have been defined to realize such parse-trees. The capsule neural network (CapsNet), by Sabour, Frosst, and Hinton, is the first actual implementation of the conceptual idea of capsule neural networks. CapsNets achieved state-of-the-art performance on simple image recognition tasks with fewer parameters and greater robustness to affine transformations than comparable approaches. This sparked extensive follow-up research. However, despite major efforts, no work was able to scale the CapsNet architecture to more reasonable-sized datasets. Here, we provide a reason for this failure and argue that it is most likely not possible to scale CapsNets beyond toy examples. In particular, we show that the concept of a parse-tree, the main idea behind capsule neuronal networks, is not present in CapsNets. We also show theoretically and experimentally that CapsNets suffer from a vanishing gradient problem that results in the starvation of many capsules during training.  ( 2 min )
    Process, Bias and Temperature Scalable CMOS Analog Computing Circuits for Machine Learning. (arXiv:2205.05664v3 [cs.AR] UPDATED)
    Analog computing is attractive compared to digital computing due to its potential for achieving higher computational density and higher energy efficiency. However, unlike digital circuits, conventional analog computing circuits cannot be easily mapped across different process nodes due to differences in transistor biasing regimes, temperature variations and limited dynamic range. In this work, we generalize the previously reported margin-propagation-based analog computing framework for designing novel \textit{shape-based analog computing} (S-AC) circuits that can be easily cross-mapped across different process nodes. Similar to digital designs S-AC designs can also be scaled for precision, speed, and power. As a proof-of-concept, we show several examples of S-AC circuits implementing mathematical functions that are commonly used in machine learning (ML) architectures. Using circuit simulations we demonstrate that the circuit input/output characteristics remain robust when mapped from a planar CMOS 180nm process to a FinFET 7nm process. Also, using benchmark datasets we demonstrate that the classification accuracy of a S-AC based neural network remains robust when mapped across the two processes and to changes in temperature.  ( 2 min )
    Automating Nearest Neighbor Search Configuration with Constrained Optimization. (arXiv:2301.01702v1 [cs.LG])
    The approximate nearest neighbor (ANN) search problem is fundamental to efficiently serving many real-world machine learning applications. A number of techniques have been developed for ANN search that are efficient, accurate, and scalable. However, such techniques typically have a number of parameters that affect the speed-recall tradeoff, and exhibit poor performance when such parameters aren't properly set. Tuning these parameters has traditionally been a manual process, demanding in-depth knowledge of the underlying search algorithm. This is becoming an increasingly unrealistic demand as ANN search grows in popularity. To tackle this obstacle to ANN adoption, this work proposes a constrained optimization-based approach to tuning quantization-based ANN algorithms. Our technique takes just a desired search cost or recall as input, and then generates tunings that, empirically, are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks.  ( 2 min )
    GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation. (arXiv:2203.02177v2 [cs.LG] UPDATED)
    Conversations have become a critical data format on social media platforms. Understanding conversation from emotion, content and other aspects also attracts increasing attention from researchers due to its widespread application in human-computer interaction. In real-world environments, we often encounter the problem of incomplete modalities, which has become a core issue of conversation understanding. To address this problem, researchers propose various methods. However, existing approaches are mainly designed for individual utterances rather than conversational data, which cannot fully exploit temporal and speaker information in conversations. To this end, we propose a novel framework for incomplete multimodal learning in conversations, called "Graph Complete Network (GCNet)", filling the gap of existing works. Our GCNet contains two well-designed graph neural network-based modules, "Speaker GNN" and "Temporal GNN", to capture temporal and speaker dependencies. To make full use of complete and incomplete data, we jointly optimize classification and reconstruction tasks in an end-to-end manner. To verify the effectiveness of our method, we conduct experiments on three benchmark conversational datasets. Experimental results demonstrate that our GCNet is superior to existing state-of-the-art approaches in incomplete multimodal learning. Code is available at https://github.com/zeroQiaoba/GCNet.  ( 2 min )
    Online Service Migration in Mobile Edge with Incomplete System Information: A Deep Recurrent Actor-Critic Learning Approach. (arXiv:2012.08679v5 [cs.NI] UPDATED)
    Multi-access Edge Computing (MEC) is an emerging computing paradigm that extends cloud computing to the network edge to support resource-intensive applications on mobile devices. As a crucial problem in MEC, service migration needs to decide how to migrate user services for maintaining the Quality-of-Service when users roam between MEC servers with limited coverage and capacity. However, finding an optimal migration policy is intractable due to the dynamic MEC environment and user mobility. Many existing studies make centralized migration decisions based on complete system-level information, which is time-consuming and also lacks desirable scalability. To address these challenges, we propose a novel learning-driven method, which is user-centric and can make effective online migration decisions by utilizing incomplete system-level information. Specifically, the service migration problem is modeled as a Partially Observable Markov Decision Process (POMDP). To solve the POMDP, we design a new encoder network that combines a Long Short-Term Memory (LSTM) and an embedding matrix for effective extraction of hidden information, and further propose a tailored off-policy actor-critic algorithm for efficient training. The extensive experimental results based on real-world mobility traces demonstrate that this new method consistently outperforms both the heuristic and state-of-the-art learning-driven algorithms and can achieve near-optimal results on various MEC scenarios.  ( 2 min )
    Fairness in Graph Mining: A Survey. (arXiv:2204.09888v2 [cs.LG] UPDATED)
    Graph mining algorithms have been playing a significant role in myriad fields over the years. However, despite their promising performance on various graph analytical tasks, most of these algorithms lack fairness considerations. As a consequence, they could lead to discrimination towards certain populations when exploited in human-centered applications. Recently, algorithmic fairness has been extensively studied in graph-based applications. In contrast to algorithmic fairness on independent and identically distributed (i.i.d.) data, fairness in graph mining has exclusive backgrounds, taxonomies, and fulfilling techniques. In this survey, we provide a comprehensive and up-to-date introduction of existing literature under the context of fair graph mining. Specifically, we propose a novel taxonomy of fairness notions on graphs, which sheds light on their connections and differences. We further present an organized summary of existing techniques that promote fairness in graph mining. Finally, we summarize the widely used datasets in this emerging research field and provide insights on current research challenges and open questions, aiming at encouraging cross-breeding ideas and further advances.  ( 2 min )
    Empirical Differential Privacy. (arXiv:1910.12820v5 [cs.LG] UPDATED)
    We show how to achieve differential privacy with no or reduced added noise, based on the empirical noise in the data itself. Unlike previous works on noiseless privacy, the empirical viewpoint avoids making any explicit assumptions about the random process generating the data.  ( 2 min )
    ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search. (arXiv:2202.01461v3 [cs.AI] UPDATED)
    A tree-based online search algorithm iteratively simulates trajectories and updates action-values of a set of states stored in a tree structure. It works reasonably well in practice but fails to take advantage of the information gathered from similar states. Depending upon the smoothness of the action-value function, a simple way to interpolate information among similar states is to perform online learning; policy gradient search provides a practical algorithm to achieve this. However, policy gradient search does not have an explicit exploration mechanism, which is present in tree-based online search algorithms. In this paper, we propose an efficient and effective online search algorithm, named Exploratory Policy Gradient Search (ExPoSe), that leverages information sharing among states by directly updating the search policy parameters while following a well-defined exploration mechanism during the online search. We conduct experiments on several decision-making problems, including Atari games, Sokoban and Hamiltonian cycle search in sparse graphs and show that ExPoSe consistently outperforms popular online search algorithms across all domains.  ( 2 min )
    Social Fraud Detection Review: Methods, Challenges and Analysis. (arXiv:2111.05645v2 [cs.LG] UPDATED)
    Social reviews have dominated the web and become a plausible source of product information. People and businesses use such information for decision-making. Businesses also make use of social information to spread fake information using a single user, groups of users, or a bot trained to generate fraudulent content. Many studies proposed approaches based on user behaviors and review text to address the challenges of fraud detection. To provide an exhaustive literature review, social fraud detection is reviewed using a framework that considers three key components: the review itself, the user who carries out the review, and the item being reviewed. As features are extracted for the component representation, a feature-wise review is provided based on behavioral, text-based features and their combination. With this framework, a comprehensive overview of approaches is presented including supervised, semi-supervised, and unsupervised learning. The supervised approaches for fraud detection are introduced and categorized into two sub-categories; classical, and deep learning. The lack of labeled datasets is explained and potential solutions are suggested. To help new researchers in the area develop a better understanding, a topic analysis and an overview of future directions is provided in each step of the proposed systematic framework.  ( 2 min )
    Graph state-space models. (arXiv:2301.01741v1 [cs.LG])
    State-space models constitute an effective modeling tool to describe multivariate time series and operate by maintaining an updated representation of the system state from which predictions are made. Within this framework, relational inductive biases, e.g., associated with functional dependencies existing among signals, are not explicitly exploited leaving unattended great opportunities for effective modeling approaches. The manuscript aims, for the first time, at filling this gap by matching state-space modeling and spatio-temporal data where the relational information, say the functional graph capturing latent dependencies, is learned directly from data and is allowed to change over time. Within a probabilistic formulation that accounts for the uncertainty in the data-generating process, an encoder-decoder architecture is proposed to learn the state-space model end-to-end on a downstream task. The proposed methodological framework generalizes several state-of-the-art methods and demonstrates to be effective in extracting meaningful relational information while achieving optimal forecasting performance in controlled environments.  ( 2 min )
    Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks. (arXiv:2006.07356v3 [stat.ML] UPDATED)
    We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.  ( 2 min )
    Holistic Adversarial Robustness of Deep Learning Models. (arXiv:2202.07201v2 [cs.LG] UPDATED)
    Adversarial robustness studies the worst-case performance of a machine learning model to ensure safety and reliability. With the proliferation of deep-learning-based technology, the potential risks associated with model development and deployment can be amplified and become dreadful vulnerabilities. This paper provides a comprehensive overview of research topics and foundational principles of research methods for adversarial robustness of deep learning models, including attacks, defenses, verification, and novel applications.  ( 2 min )
    Bias-Scalable Near-Memory CMOS Analog Processor for Machine Learning. (arXiv:2202.05022v3 [cs.ET] UPDATED)
    Bias-scalable analog computing is attractive for implementing machine learning (ML) processors with distinct power-performance specifications. For instance, ML implementations for server workloads are focused on higher computational throughput for faster training, whereas ML implementations for edge devices are focused on energy-efficient inference. In this paper, we demonstrate the implementation of bias-scalable approximate analog computing circuits using the generalization of the margin-propagation principle called shape-based analog computing (S-AC). The resulting S-AC core integrates several near-memory compute elements, which include: (a) non-linear activation functions; (b) inner-product compute circuits; and (c) a mixed-signal compressive memory, all of which can be scaled for performance or power while preserving its functionality. Using measured results from prototypes fabricated in a 180nm CMOS process, we demonstrate that the performance of computing modules remains robust to transistor biasing and variations in temperature. In this paper, we also demonstrate the effect of bias-scalability and computational accuracy on a simple ML regression task.  ( 2 min )
    Learning to segment fetal brain tissue from noisy annotations. (arXiv:2203.14962v2 [eess.IV] UPDATED)
    Automatic fetal brain tissue segmentation can enhance the quantitative assessment of brain development at this critical stage. Deep learning methods represent the state of the art in medical image segmentation and have also achieved impressive results in brain segmentation. However, effective training of a deep learning model to perform this task requires a large number of training images to represent the rapid development of the transient fetal brain structures. On the other hand, manual multi-label segmentation of a large number of 3D images is prohibitive. To address this challenge, we segmented 272 training images, covering 19-39 gestational weeks, using an automatic multi-atlas segmentation strategy based on deformable registration and probabilistic atlas fusion, and manually corrected large errors in those segmentations. Since this process generated a large training dataset with noisy segmentations, we developed a novel label smoothing procedure and a loss function to train a deep learning model with smoothed noisy segmentations. Our proposed methods properly account for the uncertainty in tissue boundaries. We evaluated our method on 23 manually-segmented test images of a separate set of fetuses. Results show that our method achieves an average Dice similarity coefficient of 0.893 and 0.916 for the transient structures of younger and older fetuses, respectively. Our method generated results that were significantly more accurate than several state-of-the-art methods including nnU-Net that achieved the closest results to our method. Our trained model can serve as a valuable tool to enhance the accuracy and reproducibility of fetal brain analysis in MRI.  ( 2 min )
    Large-width asymptotics for ReLU neural networks with $\alpha$-Stable initializations. (arXiv:2206.08065v3 [cs.LG] UPDATED)
    There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for $\alpha$-Stable NNs, namely NNs whose weights are initialized as $\alpha$-Stable distributions with $\alpha\in(0,2]$. First, for $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of $\alpha$-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the $\alpha$-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $\alpha$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $\alpha$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.  ( 3 min )
    Neural Message Passing for Objective-Based Uncertainty Quantification and Optimal Experimental Design. (arXiv:2203.07120v2 [cs.LG] UPDATED)
    Various real-world scientific applications involve the mathematical modeling of complex uncertain systems with numerous unknown parameters. Accurate parameter estimation is often practically infeasible in such systems, as the available training data may be insufficient and the cost of acquiring additional data may be high. In such cases, it is desirable to represent the model uncertainty in a Bayesian paradigm, based on which we can design robust operators retaining the best overall performance across all possible models and design optimal experiments that can effectively reduce uncertainty to enhance the performance of such operators maximally. While objective-based uncertainty quantification (objective-UQ) based on MOCU (mean objective cost of uncertainty) has been an effective means for quantifying uncertainty in complex systems, a major drawback has been the high computational cost of estimating MOCU. In this work, we propose a novel scheme to reduce the computational cost for objective-UQ via MOCU based on a data-driven approach. We adopt a neural message-passing model for surrogate modeling, incorporating a novel axiomatic constraint loss that penalizes an increase in the estimated system uncertainty. As an illustrative example, we consider the optimal experimental design (OED) problem for uncertain Kuramoto models, where the goal is to predict the experiments that can most effectively enhance robust synchronization performance through uncertainty reduction. We show that our proposed approach can accelerate MOCU-based OED by four to five orders of magnitude, virtually without any visible performance loss compared to the state-of-the-art. The proposed approach is applicable to general OED tasks, beyond the Kuramoto model.  ( 2 min )
    Semi-Supervised Subspace Clustering via Tensor Low-Rank Representation. (arXiv:2205.10481v2 [cs.CV] UPDATED)
    In this letter, we propose a novel semi-supervised subspace clustering method, which is able to simultaneously augment the initial supervisory information and construct a discriminative affinity matrix. By representing the limited amount of supervisory information as a pairwise constraint matrix, we observe that the ideal affinity matrix for clustering shares the same low-rank structure as the ideal pairwise constraint matrix. Thus, we stack the two matrices into a 3-D tensor, where a global low-rank constraint is imposed to promote the affinity matrix construction and augment the initial pairwise constraints synchronously. Besides, we use the local geometry structure of input samples to complement the global low-rank prior to achieve better affinity matrix learning. The proposed model is formulated as a Laplacian graph regularized convex low-rank tensor representation problem, which is further solved with an alternative iterative algorithm. In addition, we propose to refine the affinity matrix with the augmented pairwise constraints. Comprehensive experimental results on eight commonly-used benchmark datasets demonstrate the superiority of our method over state-of-the-art methods. The code is publicly available at https://github.com/GuanxingLu/Subspace-Clustering.  ( 2 min )
    GUAP: Graph Universal Attack Through Adversarial Patching. (arXiv:2301.01731v1 [cs.LG])
    Graph neural networks (GNNs) are a class of effective deep learning models for node classification tasks; yet their predictive capability may be severely compromised under adversarially designed unnoticeable perturbations to the graph structure and/or node data. Most of the current work on graph adversarial attacks aims at lowering the overall prediction accuracy, but we argue that the resulting abnormal model performance may catch attention easily and invite quick counterattack. Moreover, attacks through modification of existing graph data may be hard to conduct if good security protocols are implemented. In this work, we consider an easier attack harder to be noticed, through adversarially patching the graph with new nodes and edges. The attack is universal: it targets a single node each time and flips its connection to the same set of patch nodes. The attack is unnoticeable: it does not modify the predictions of nodes other than the target. We develop an algorithm, named GUAP, that achieves high attack success rate but meanwhile preserves the prediction accuracy. GUAP is fast to train by employing a sampling strategy. We demonstrate that a 5% sampling in each epoch yields 20x speedup in training, with only a slight degradation in attack performance. Additionally, we show that the adversarial patch trained with the graph convolutional network transfers well to other GNNs, such as the graph attention network.  ( 2 min )
    Task-Oriented Data Compression for Multi-Agent Communications Over Bit-Budgeted Channels. (arXiv:2005.14220v5 [cs.IT] UPDATED)
    Various applications for inter-machine communications are on the rise. Whether it is for autonomous driving vehicles or the internet of everything, machines are more connected than ever to improve their performance in fulfilling a given task. While in traditional communications the goal has often been to reconstruct the underlying message, under the emerging task-oriented paradigm, the goal of communication is to enable the receiving end to make more informed decisions or more precise estimates/computations. Motivated by these recent developments, in this paper, we perform an indirect design of the communications in a multi-agent system (MAS) in which agents cooperate to maximize the averaged sum of discounted one-stage rewards of a collaborative task. Due to the bit-budgeted communications between the agents, each agent should efficiently represent its local observation and communicate an abstracted version of the observations to improve the collaborative task performance. We first show that this problem can be approximated as a form of data-quantization problem which we call task-oriented data compression (TODC). We then introduce the state-aggregation for information compression algorithm (SAIC) to solve the formulated TODC problem. It is shown that SAIC is able to achieve near-optimal performance in terms of the achieved sum of discounted rewards. The proposed algorithm is applied to a geometric consensus problem and its performance is compared with several benchmarks. Numerical experiments confirm the promise of this indirect design approach for task-oriented multi-agent communications.  ( 3 min )
    Constrained regret minimization for multi-criterion multi-armed bandits. (arXiv:2006.09649v2 [cs.LG] UPDATED)
    We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a 'primary' attribute subject to user-provided constraints on other 'secondary' attributes. We assume that the attributes can be estimated using samples from the arms' distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a Boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.  ( 2 min )
    A Dual-Purpose Deep Learning Model for Auscultated Lung and Tracheal Sound Analysis Based on Mixed Set Training. (arXiv:2107.04229v2 [cs.SD] UPDATED)
    Many deep learning-based computerized respiratory sound analysis methods have previously been developed. However, these studies focus on either lung sound only or tracheal sound only. The effectiveness of using a lung sound analysis algorithm on tracheal sound and vice versa has never been investigated. Furthermore, no one knows whether using lung and tracheal sounds together in training a respiratory sound analysis model is beneficial. In this study, we first constructed a tracheal sound database, HF_Tracheal_V1, containing 10448 15-s tracheal sound recordings, 21741 inhalation labels, 15858 exhalation labels, and 6414 continuous adventitious sound (CAS) labels. HF_Tracheal_V1 and our previously built lung sound database, HF_Lung_V2, were either combined (mixed set), used one after the other (domain adaptation), or used alone to train convolutional neural network bidirectional gate recurrent unit models for inhalation, exhalation, and CAS detection in lung and tracheal sounds. The results revealed that the models trained using lung sound alone performed poorly in tracheal sound analysis and vice versa. However, mixed set training or domain adaptation improved the performance for 1) inhalation and exhalation detection in lung sounds and 2) inhalation, exhalation, and CAS detection in tracheal sounds compared to positive controls (the models trained using lung sound alone and used in lung sound analysis and vice versa). In particular, the model trained on the mixed set had great flexibility to serve two purposes, lung and tracheal sound analyses, at the same time.  ( 3 min )
    Best Arm Identification with Contextual Information under a Small Gap. (arXiv:2209.07330v4 [cs.LG] UPDATED)
    We study the best-arm identification (BAI) problem with a fixed budget and contextual (covariate) information. In each round of an adaptive experiment, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, which is a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. In this study, we consider a class of nonparametric bandit models that converge to location-shift models when the gaps go to zero. First, we derive lower bounds of the misidentification probability for a certain class of strategies and bandit models (probabilistic models of potential outcomes) under a small-gap regime. A small-gap regime is a situation where gaps of the expected rewards between the best and suboptimal treatment arms go to zero, which corresponds to one of the worst cases in identifying the best treatment arm. We then develop the ``Random Sampling (RS)-Augmented Inverse Probability weighting (AIPW) strategy,'' which is asymptotically optimal in the sense that the probability of misidentification under the strategy matches the lower bound when the budget goes to infinity in the small-gap regime. The RS-AIPW strategy consists of the RS rule tracking a target sample allocation ratio and the recommendation rule using the AIPW estimator.  ( 2 min )
    Pattern Recognition Experiments on Mathematical Expressions. (arXiv:2301.01624v1 [math.NT])
    We provide the results of pattern recognition experiments on mathematical expressions. We give a few examples of conjectured results. None of which was thoroughly checked for novelty. We did not attempt to prove all the relations found and focused on their generation.  ( 2 min )
    Cost-Sensitive Stacking: an Empirical Evaluation. (arXiv:2301.01748v1 [cs.LG])
    Many real-world classification problems are cost-sensitive in nature, such that the misclassification costs vary between data instances. Cost-sensitive learning adapts classification algorithms to account for differences in misclassification costs. Stacking is an ensemble method that uses predictions from several classifiers as the training data for another classifier, which in turn makes the final classification decision. While a large body of empirical work exists where stacking is applied in various domains, very few of these works take the misclassification costs into account. In fact, there is no consensus in the literature as to what cost-sensitive stacking is. In this paper we perform extensive experiments with the aim of establishing what the appropriate setup for a cost-sensitive stacking ensemble is. Our experiments, conducted on twelve datasets from a number of application domains, using real, instance-dependent misclassification costs, show that for best performance, both levels of stacking require cost-sensitive classification decision.  ( 2 min )
    Augmenting data-driven models for energy systems through feature engineering: A Python framework for feature engineering. (arXiv:2301.01720v1 [cs.LG])
    Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are being applied. While these methods do not require domain knowledge, they are sensitive to data quality. Therefore, improving data quality in a dataset is beneficial for creating machine learning-based models. The improvement of data quality can be implemented through preprocessing methods. A selected type of preprocessing is feature engineering, which focuses on evaluating and improving the quality of certain features inside the dataset. Feature engineering methods include methods such as feature creation, feature expansion, or feature selection. In this work, a Python framework containing different feature engineering methods is presented. This framework contains different methods for feature creation, expansion and selection; in addition, methods for transforming or filtering data are implemented. The implementation of the framework is based on the Python library scikit-learn. The framework is demonstrated on a case study of a use case from energy demand prediction. A data-driven model is created including selected feature engineering methods. The results show an improvement in prediction accuracy through the engineered features.  ( 2 min )
    Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow. (arXiv:2301.01766v1 [math.ST])
    Gaussian mixture models form a flexible and expressive parametric family of distributions that has found applications in a wide variety of applications. Unfortunately, fitting these models to data is a notoriously hard problem from a computational perspective. Currently, only moment-based methods enjoy theoretical guarantees while likelihood-based methods are dominated by heuristics such as Expectation-Maximization that are known to fail in simple examples. In this work, we propose a new algorithm to compute the nonparametric maximum likelihood estimator (NPMLE) in a Gaussian mixture model. Our method is based on gradient descent over the space of probability measures equipped with the Wasserstein-Fisher-Rao geometry for which we establish convergence guarantees. In practice, it can be approximated using an interacting particle system where the weight and location of particles are updated alternately. We conduct extensive numerical experiments to confirm the effectiveness of the proposed algorithm compared not only to classical benchmarks but also to similar gradient descent algorithms with respect to simpler geometries. In particular, these simulations illustrate the benefit of updating both weight and location of the interacting particles.  ( 2 min )
    Episodes Discovery Recommendation with Multi-Source Augmentations. (arXiv:2301.01737v1 [cs.IR])
    Recommender systems (RS) commonly retrieve potential candidate items for users from a massive number of items by modeling user interests based on historical interactions. However, historical interaction data is highly sparse, and most items are long-tail items, which limits the representation learning for item discovery. This problem is further augmented by the discovery of novel or cold-start items. For example, after a user displays interest in bitcoin financial investment shows in the podcast space, a recommender system may want to suggest, e.g., a newly released blockchain episode from a more technical show. Episode correlations help the discovery, especially when interaction data of episodes is limited. Accordingly, we build upon the classical Two-Tower model and introduce the novel Multi-Source Augmentations using a Contrastive Learning framework (MSACL) to enhance episode embedding learning by incorporating positive episodes from numerous correlated semantics. Extensive experiments on a real-world podcast recommendation dataset from a large audio streaming platform demonstrate the effectiveness of the proposed framework for user podcast exploration and cold-start episode recommendation.  ( 2 min )
    Matrices with Gaussian noise: optimal estimates for singular subspace perturbation. (arXiv:1803.00679v2 [stat.ML] UPDATED)
    The Davis--Kahan--Wedin $\sin \Theta$ theorem describes how the singular subspaces of a matrix change when subjected to a small perturbation. This classic result is sharp in the worst case scenario. In this paper, we prove a stochastic version of the Davis--Kahan--Wedin $\sin \Theta$ theorem when the perturbation is a Gaussian random matrix. Under certain structural assumptions, we obtain an optimal bound that significantly improves upon the classic Davis--Kahan--Wedin $\sin \Theta$ theorem. One of our key tools is a new perturbation bound for the singular values, which may be of independent interest.  ( 2 min )
    Text sampling strategies for predicting missing bibliographic links. (arXiv:2301.01673v1 [cs.LG])
    The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighboring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This method of detecting missing bibliographic links can be used in recommendation engines of applied intelligent information systems.  ( 2 min )
    Learning-based MPC from Big Data Using Reinforcement Learning. (arXiv:2301.01667v1 [eess.SY])
    This paper presents an approach for learning Model Predictive Control (MPC) schemes directly from data using Reinforcement Learning (RL) methods. The state-of-the-art learning methods use RL to improve the performance of parameterized MPC schemes. However, these learning algorithms are often gradient-based methods that require frequent evaluations of computationally expensive MPC schemes, thereby restricting their use on big datasets. We propose to tackle this issue by using tools from RL to learn a parameterized MPC scheme directly from data in an offline fashion. Our approach derives an MPC scheme without having to solve it over the collected dataset, thereby eliminating the computational complexity of existing techniques for big data. We evaluate the proposed method on three simulated experiments of varying complexity.  ( 2 min )
    A General Framework for Learning Mean-Field Games. (arXiv:2003.06069v3 [cs.LG] UPDATED)
    This paper presents a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It first establishes the existence of a unique Nash Equilibrium to this GMFG, and demonstrates that naively combining reinforcement learning with the fixed-point approach in classical MFGs yields unstable algorithms. It then proposes value-based and policy-based reinforcement learning algorithms (GMF-V and GMF-P, respectively) with smoothed policies, with analysis of their convergence properties and computational complexities. Experiments on an equilibrium product pricing problem demonstrate that GMF-V-Q and GMF-P-TRPO, two specific instantiations of GMF-V and GMF-P, respectively, with Q-learning and TRPO, are both efficient and robust in the GMFG setting. Moreover, their performance is superior in convergence speed, accuracy, and stability when compared with existing algorithms for multi-agent reinforcement learning in the $N$-player setting.  ( 2 min )
    Learning Decorrelated Representations Efficiently Using Fast Fourier Transform. (arXiv:2301.01569v1 [cs.LG])
    Barlow Twins and VICReg are self-supervised representation learning models that use regularizers to decorrelate features. Although they work as well as conventional representation learning models, their training can be computationally demanding if the dimension of projected representations is high; as these regularizers are defined in terms of individual elements of a cross-correlation or covariance matrix, computing the loss for $d$-dimensional projected representations of $n$ samples takes $O(n d^2)$ time. In this paper, we propose a relaxed version of decorrelating regularizers that can be computed in $O(n d\log d)$ time by the fast Fourier transform. We also propose an inexpensive trick to mitigate the undesirable local minima that develop with the relaxation. Models learning representations using the proposed regularizers show comparable accuracy to existing models in downstream tasks, whereas the training requires less memory and is faster when $d$ is large.  ( 2 min )
    Learning Ambiguity from Crowd Sequential Annotations. (arXiv:2301.01579v1 [cs.CL])
    Most crowdsourcing learning methods treat disagreement between annotators as noisy labelings while inter-disagreement among experts is often a good indicator for the ambiguity and uncertainty that is inherent in natural language. In this paper, we propose a framework called Learning Ambiguity from Crowd Sequential Annotations (LA-SCA) to explore the inter-disagreement between reliable annotators and effectively preserve confusing label information. First, a hierarchical Bayesian model is developed to infer ground-truth from crowds and group the annotators with similar reliability together. By modeling the relationship between the size of group the annotator involved in, the annotator's reliability and element's unambiguity in each sequence, inter-disagreement between reliable annotators on ambiguous elements is computed to obtain label confusing information that is incorporated to cost-sensitive sequence labeling. Experimental results on POS tagging and NER tasks show that our proposed framework achieves competitive performance in inferring ground-truth from crowds and predicting unknown sequences, and interpreting hierarchical clustering results helps discover labeling patterns of annotators with similar reliability.  ( 2 min )
    Demystify Problem-Dependent Power of Quantum Neural Networks on Multi-Class Classification. (arXiv:2301.01597v1 [quant-ph])
    Quantum neural networks (QNNs) have become an important tool for understanding the physical world, but their advantages and limitations are not fully understood. Some QNNs with specific encoding methods can be efficiently simulated by classical surrogates, while others with quantum memory may perform better than classical classifiers. Here we systematically investigate the problem-dependent power of quantum neural classifiers (QCs) on multi-class classification tasks. Through the analysis of expected risk, a measure that weighs the training loss and the generalization error of a classifier jointly, we identify two key findings: first, the training loss dominates the power rather than the generalization ability; second, QCs undergo a U-shaped risk curve, in contrast to the double-descent risk curve of deep neural classifiers. We also reveal the intrinsic connection between optimal QCs and the Helstrom bound and the equiangular tight frame. Using these findings, we propose a method that uses loss dynamics to probe whether a QC may be more effective than a classical classifier on a particular learning task. Numerical results demonstrate the effectiveness of our approach to explain the superiority of QCs over multilayer Perceptron on parity datasets and their limitations over convolutional neural networks on image datasets. Our work sheds light on the problem-dependent power of QNNs and offers a practical tool for evaluating their potential merit.  ( 2 min )
    COVID-Net USPro: An Open-Source Explainable Few-Shot Deep Prototypical Network to Monitor and Detect COVID-19 Infection from Point-of-Care Ultrasound Images. (arXiv:2301.01679v1 [eess.IV])
    As the Coronavirus Disease 2019 (COVID-19) continues to impact many aspects of life and the global healthcare systems, the adoption of rapid and effective screening methods to prevent further spread of the virus and lessen the burden on healthcare providers is a necessity. As a cheap and widely accessible medical image modality, point-of-care ultrasound (POCUS) imaging allows radiologists to identify symptoms and assess severity through visual inspection of the chest ultrasound images. Combined with the recent advancements in computer science, applications of deep learning techniques in medical image analysis have shown promising results, demonstrating that artificial intelligence-based solutions can accelerate the diagnosis of COVID-19 and lower the burden on healthcare professionals. However, the lack of a huge amount of well-annotated data poses a challenge in building effective deep neural networks in the case of novel diseases and pandemics. Motivated by this, we present COVID-Net USPro, an explainable few-shot deep prototypical network, that monitors and detects COVID-19 positive cases with high precision and recall from minimal ultrasound images. COVID-Net USPro achieves 99.65% overall accuracy, 99.7% recall and 99.67% precision for COVID-19 positive cases when trained with only 5 shots. The analytic pipeline and results were verified by our contributing clinician with extensive experience in POCUS interpretation, ensuring that the network makes decisions based on actual patterns.  ( 2 min )
    Power Spectral Density-Based Resting-State EEG Classification of First-Episode Psychosis. (arXiv:2301.01588v1 [q-bio.NC])
    Historically, the analysis of stimulus-dependent time-frequency patterns has been the cornerstone of most electroencephalography (EEG) studies. The abnormal oscillations in high-frequency waves associated with psychotic disorders during sensory and cognitive tasks have been studied many times. However, any significant dissimilarity in the resting-state low-frequency bands is yet to be established. Spectral analysis of the alpha and delta band waves shows the effectiveness of stimulus-independent EEG in identifying the abnormal activity patterns of pathological brains. A generalized model incorporating multiple frequency bands should be more efficient in associating potential EEG biomarkers with First-Episode Psychosis (FEP), leading to an accurate diagnosis. We explore multiple machine-learning methods, including random-forest, support vector machine, and Gaussian Process Classifier (GPC), to demonstrate the practicality of resting-state Power Spectral Density (PSD) to distinguish patients of FEP from healthy controls. A comprehensive discussion of our preprocessing methods for PSD analysis and a detailed comparison of different models are included in this paper. The GPC model outperforms the other models with a specificity of 95.78% to show that PSD can be used as an effective feature extraction technique for analyzing and classifying resting-state EEG signals of psychiatric disorders.  ( 2 min )
    Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries. (arXiv:2301.01701v1 [cs.CR])
    Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by generating synthetic datasets and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.  ( 2 min )
    On the Convergence of Stochastic Gradient Descent in Low-precision Number Formats. (arXiv:2301.01651v1 [cs.LG])
    Deep learning models are dominating almost all artificial intelligence tasks such as vision, text, and speech processing. Stochastic Gradient Descent (SGD) is the main tool for training such models, where the computations are usually performed in single-precision floating-point number format. The convergence of single-precision SGD is normally aligned with the theoretical results of real numbers since they exhibit negligible error. However, the numerical error increases when the computations are performed in low-precision number formats. This provides compelling reasons to study the SGD convergence adapted for low-precision computations. We present both deterministic and stochastic analysis of the SGD algorithm, obtaining bounds that show the effect of number format. Such bounds can provide guidelines as to how SGD convergence is affected when constraints render the possibility of performing high-precision computations remote.  ( 2 min )
    On Fairness of Medical Image Classification with Multiple Sensitive Attributes via Learning Orthogonal Representations. (arXiv:2301.01481v1 [cs.CV])
    Mitigating the discrimination of machine learning models has gained increasing attention in medical image analysis. However, rare works focus on fair treatments for patients with multiple sensitive demographic ones, which is a crucial yet challenging problem for real-world clinical applications. In this paper, we propose a novel method for fair representation learning with respect to multi-sensitive attributes. We pursue the independence between target and multi-sensitive representations by achieving orthogonality in the representation space. Concretely, we enforce the column space orthogonality by keeping target information on the complement of a low-rank sensitive space. Furthermore, in the row space, we encourage feature dimensions between target and sensitive representations to be orthogonal. The effectiveness of the proposed method is demonstrated with extensive experiments on the CheXpert dataset. To our best knowledge, this is the first work to mitigate unfairness with respect to multiple sensitive attributes in the field of medical imaging.  ( 2 min )
    A Survey on Deep Industrial Transfer Learning in Fault Prognostics. (arXiv:2301.01705v1 [cs.LG])
    Due to its probabilistic nature, fault prognostics is a prime example of a use case for deep learning utilizing big data. However, the low availability of such data sets combined with the high effort of fitting, parameterizing and evaluating complex learning algorithms to the heterogenous and dynamic settings typical for industrial applications oftentimes prevents the practical application of this approach. Automatic adaptation to new or dynamically changing fault prognostics scenarios can be achieved using transfer learning or continual learning methods. In this paper, a first survey of such approaches is carried out, aiming at establishing best practices for future research in this field. It is shown that the field is lacking common benchmarks to robustly compare results and facilitate scientific progress. Therefore, the data sets utilized in these publications are surveyed as well in order to identify suitable candidates for such benchmark scenarios.  ( 2 min )
    Data-Driven Model Identification via Hyperparameter Optimization for the Autonomous Racing System. (arXiv:2301.01470v1 [cs.RO])
    In this letter, we propose a model identification method via hyperparameter optimization (MIHO). Our method is able to identify the parameters of the parametric models in a data-driven manner. We utilize MIHO for the dynamics parameters of the AV-21, the full-scaled autonomous race vehicle, and integrate them into our model-based planning and control systems. In experiments, the models with the optimized parameters demonstrate the generalization ability of the vehicle dynamics model. We further conduct extensive field tests to validate our model-based system. The tests show that our race systems leverage the learned model dynamics well and successfully perform obstacle avoidance and high-speed driving over $200 km/h$ at the Indianapolis Motor Speedway and Las Vegas Motor Speedway. The source code for MIHO and videos of the tests are available at https://github.com/hynkis/MIHO.  ( 2 min )
    CI-GNN: A Granger Causality-Inspired Graph Neural Network for Interpretable Brain Network-Based Psychiatric Diagnosis. (arXiv:2301.01642v1 [stat.ML])
    There is a recent trend to leverage the power of graph neural networks (GNNs) for brain-network based psychiatric diagnosis, which,in turn, also motivates an urgent need for psychiatrists to fully understand the decision behavior of the used GNNs. However, most of the existing GNN explainers are either post-hoc in which another interpretive model needs to be created to explain a well-trained GNN, or do not consider the causal relationship between the extracted explanation and the decision, such that the explanation itself contains spurious correlations and suffers from weak faithfulness. In this work, we propose a granger causality-inspired graph neural network (CI-GNN), a built-in interpretable model that is able to identify the most influential subgraph (i.e., functional connectivity within brain regions) that is causally related to the decision (e.g., major depressive disorder patients or healthy controls), without the training of an auxillary interpretive network. CI-GNN learns disentangled subgraph-level representations {\alpha} and \b{eta} that encode, respectively, the causal and noncausal aspects of original graph under a graph variational autoencoder framework, regularized by a conditional mutual information (CMI) constraint. We theoretically justify the validity of the CMI regulation in capturing the causal relationship. We also empirically evaluate the performance of CI-GNN against three baseline GNNs and four state-of-the-art GNN explainers on synthetic data and two large-scale brain disease datasets. We observe that CI-GNN achieves the best performance in a wide range of metrics and provides more reliable and concise explanations which have clinical evidence.  ( 2 min )
    Hospital transfer risk prediction for COVID-19 patients from a medicalized hotel based on Diffusion GraphSAGE. (arXiv:2301.01596v1 [cs.LG])
    The global COVID-19 pandemic has caused more than six million deaths worldwide. Medicalized hotels were established in Taiwan as quarantine facilities for COVID-19 patients with no or mild symptoms. Due to limited medical care available at these hotels, it is of paramount importance to identify patients at risk of clinical deterioration. This study aimed to develop and evaluate a graph-based deep learning approach for progressive hospital transfer risk prediction in a medicalized hotel setting. Vital sign measurements were obtained for 632 patients and daily patient similarity graphs were constructed. Inductive graph convolutional network models were trained on top of the temporally integrated graphs to predict hospital transfer risk. The proposed models achieved AUC scores above 0.83 for hospital transfer risk prediction based on the measurements of past 1, 2, and 3 days, outperforming baseline machine learning methods. A post-hoc analysis on the constructed diffusion-based graph using Local Clustering Coefficient discovered a high-risk cluster with significantly older mean age, higher body temperature, lower SpO2, and shorter length of stay. Further time-to-hospital-transfer survival analysis also revealed a significant decrease in survival probability in the discovered high-risk cluster. The obtained results demonstrated promising predictability and interpretability of the proposed graph-based approach. This technique may help preemptively detect high-risk patients at community-based medical facilities similar to a medicalized hotel.  ( 2 min )
    Multi-Task Learning with Prior Information. (arXiv:2301.01572v1 [cs.LG])
    Multi-task learning aims to boost the generalization performance of multiple related tasks simultaneously by leveraging information contained in those tasks. In this paper, we propose a multi-task learning framework, where we utilize prior knowledge about the relations between features. We also impose a penalty on the coefficients changing for each specific feature to ensure related tasks have similar coefficients on common features shared among them. In addition, we capture a common set of features via group sparsity. The objective is formulated as a non-smooth convex optimization problem, which can be solved with various methods, including gradient descent method with fixed stepsize, iterative shrinkage-thresholding algorithm (ISTA) with back-tracking, and its variation -- fast iterative shrinkage-thresholding algorithm (FISTA). In light of the sub-linear convergence rate of the methods aforementioned, we propose an asymptotically linear convergent algorithm with theoretical guarantee. Empirical experiments on both regression and classification tasks with real-world datasets demonstrate that our proposed algorithms are capable of improving the generalization performance of multiple related tasks.  ( 2 min )
    Multi-View MOOC Quality Evaluation via Information-Aware Graph Representation Learning. (arXiv:2301.01593v1 [cs.CY])
    In this paper, we study the problem of MOOC quality evaluation which is essential for improving the course materials, promoting students' learning efficiency, and benefiting user services. While achieving promising performances, current works still suffer from the complicated interactions and relationships of entities in MOOC platforms. To tackle the challenges, we formulate the problem as a course representation learning task-based and develop an Information-aware Graph Representation Learning(IaGRL) for multi-view MOOC quality evaluation. Specifically, We first build a MOOC Heterogeneous Network (HIN) to represent the interactions and relationships among entities in MOOC platforms. And then we decompose the MOOC HIN into multiple single-relation graphs based on meta-paths to depict the multi-view semantics of courses. The course representation learning can be further converted to a multi-view graph representation task. Different from traditional graph representation learning, the learned course representations are expected to match the following three types of validity: (1) the agreement on expressiveness between the raw course portfolio and the learned course representations; (2) the consistency between the representations in each view and the unified representations; (3) the alignment between the course and MOOC platform representations. Therefore, we propose to exploit mutual information for preserving the validity of course representations. We conduct extensive experiments over real-world MOOC datasets to demonstrate the effectiveness of our proposed method.  ( 2 min )
    Multi-Aspect Explainable Inductive Relation Prediction by Sentence Transformer. (arXiv:2301.01664v1 [cs.CL])
    Recent studies on knowledge graphs (KGs) show that path-based methods empowered by pre-trained language models perform well in the provision of inductive and explainable relation predictions. In this paper, we introduce the concepts of relation path coverage and relation path confidence to filter out unreliable paths prior to model training to elevate the model performance. Moreover, we propose Knowledge Reasoning Sentence Transformer (KRST) to predict inductive relations in KGs. KRST is designed to encode the extracted reliable paths in KGs, allowing us to properly cluster paths and provide multi-aspect explanations. We conduct extensive experiments on three real-world datasets. The experimental results show that compared to SOTA models, KRST achieves the best performance in most transductive and inductive test cases (4 of 6), and in 11 of 12 few-shot test cases.  ( 2 min )
    UAV aided Metaverse over Wireless Communications: A Reinforcement Learning Approach. (arXiv:2301.01474v1 [eess.SY])
    Metaverse is expected to create a virtual world closely connected with reality to provide users with immersive experience with the support of 5G high data rate communication technique. A huge amount of data in physical world needs to be synchronized to the virtual world to provide immersive experience for users, and there will be higher requirements on coverage to include more users into Metaverse. However, 5G signal suffers severe attenuation, which makes it more expensive to maintain the same coverage. Unmanned aerial vehicle (UAV) is a promising candidate technique for future implementation of Metaverse as a low-cost and high-mobility platform for communication devices. In this paper, we propose a proximal policy optimization (PPO) based double-agent cooperative reinforcement learning method for channel allocation and trajectory control of UAV to collect and synchronize data from the physical world to the virtual world, and expand the coverage of Metaverse services economically. Simulation results show that our proposed method is able to achieve better performance compared to the benchmark approaches.  ( 2 min )
    Federated Learning for Data Streams. (arXiv:2301.01542v1 [cs.LG])
    Federated learning (FL) is an effective solution to train machine learning models on the increasing amount of data generated by IoT devices and smartphones while keeping such data localized. Most previous work on federated learning assumes that clients operate on static datasets collected before training starts. This approach may be inefficient because 1) it ignores new samples clients collect during training, and 2) it may require a potentially long preparatory phase for clients to collect enough data. Moreover, learning on static datasets may be simply impossible in scenarios with small aggregate storage across devices. It is, therefore, necessary to design federated algorithms able to learn from data streams. In this work, we formulate and study the problem of \emph{federated learning for data streams}. We propose a general FL algorithm to learn from data streams through an opportune weighted empirical risk minimization. Our theoretical analysis provides insights to configure such an algorithm, and we evaluate its performance on a wide range of machine learning tasks.  ( 2 min )
    Counterfactual Explanations for Land Cover Mapping in a Multi-class Setting. (arXiv:2301.01520v1 [cs.LG])
    Counterfactual explanations are an emerging tool to enhance interpretability of deep learning models. Given a sample, these methods seek to find and display to the user similar samples across the decision boundary. In this paper, we propose a generative adversarial counterfactual approach for satellite image time series in a multi-class setting for the land cover classification task. One of the distinctive features of the proposed approach is the lack of prior assumption on the targeted class for a given counterfactual explanation. This inherent flexibility allows for the discovery of interesting information on the relationship between land cover classes. The other feature consists of encouraging the counterfactual to differ from the original sample only in a small and compact temporal segment. These time-contiguous perturbations allow for a much sparser and, thus, interpretable solution. Furthermore, plausibility/realism of the generated counterfactual explanations is enforced via the proposed adversarial learning strategy.  ( 2 min )
    Machine Learning technique for isotopic determination of radioisotopes using HPGe $\mathrm{\gamma}$-ray spectra. (arXiv:2301.01415v1 [physics.data-an])
    $\mathrm{\gamma}$-ray spectroscopy is a quantitative, non-destructive technique that may be utilized for the identification and quantitative isotopic estimation of radionuclides. Traditional methods of isotopic determination have various challenges that contribute to statistical and systematic uncertainties in the estimated isotopics. Furthermore, these methods typically require numerous pre-processing steps, and have only been rigorously tested in laboratory settings with limited shielding. In this work, we examine the application of a number of machine learning based regression algorithms as alternatives to conventional approaches for analyzing $\mathrm{\gamma}$-ray spectroscopy data in the Emergency Response arena. This approach not only eliminates many steps in the analysis procedure, and therefore offers potential to reduce this source of systematic uncertainty, but is also shown to offer comparable performance to conventional approaches in the Emergency Response Application.  ( 2 min )
    Towards the Identifiability in Noisy Label Learning: A Multinomial Mixture Approach. (arXiv:2301.01405v1 [cs.LG])
    Learning from noisy labels plays an important role in the deep learning era. Despite numerous studies with promising results, identifying clean labels from a noisily-annotated dataset is still challenging since the conventional noisy label learning problem with single noisy label per instance is not identifiable, i.e., it does not theoretically have a unique solution unless one has access to clean labels or introduces additional assumptions. This paper aims to formally investigate such identifiability issue by formulating the noisy label learning problem as a multinomial mixture model, enabling the formulation of the identifiability constraint. In particular, we prove that the noisy label learning problem is identifiable if there are at least $2C - 1$ noisy labels per instance provided, with $C$ being the number of classes. In light of such requirement, we propose a method that automatically generates additional noisy labels per training sample by estimating the noisy label distribution based on nearest neighbours. Such additional noisy labels allow us to apply the Expectation - Maximisation algorithm to estimate the posterior of clean labels. We empirically demonstrate that the proposed method is not only capable of estimating clean labels without any heuristics in several challenging label noise benchmarks, including synthetic, web-controlled and real-world label noises, but also of performing competitively with many state-of-the-art methods.  ( 2 min )
    Kernel Subspace and Feature Extraction. (arXiv:2301.01410v1 [cs.LG])
    We study kernel methods in machine learning from the perspective of feature subspace. We establish a one-to-one correspondence between feature subspaces and kernels and propose an information-theoretic measure for kernels. In particular, we construct a kernel from Hirschfeld--Gebelein--R\'{e}nyi maximal correlation functions, coined the maximal correlation kernel, and demonstrate its information-theoretic optimality. We use the support vector machine (SVM) as an example to illustrate a connection between kernel methods and feature extraction approaches. We show that the kernel SVM on maximal correlation kernel achieves minimum prediction error. Finally, we interpret the Fisher kernel as a special maximal correlation kernel and establish its optimality.  ( 2 min )
    The Predictive Forward-Forward Algorithm. (arXiv:2301.01452v1 [cs.LG])
    In this work, we propose a generalization of the forward-forward (FF) algorithm that we call the predictive forward-forward (PFF) algorithm. Specifically, we design a dynamic, recurrent neural system that learns a directed generative circuit jointly and simultaneously with a representation circuit, combining elements of predictive coding, an emerging and viable neurobiological process theory of cortical function, with the forward-forward adaptation scheme. Furthermore, PFF efficiently learns to propagate learning signals and updates synapses with forward passes only, eliminating some of the key structural and computational constraints imposed by a backprop-based scheme. Besides computational advantages, the PFF process could be further useful for understanding the learning mechanisms behind biological neurons that make use of local (and global) signals despite missing feedback connections. We run several experiments on image data and demonstrate that the PFF procedure works as well as backprop, offering a promising brain-inspired algorithm for classifying, reconstructing, and synthesizing data patterns. As a result, our approach presents further evidence of the promise afforded by backprop-alternative credit assignment algorithms within the context of brain-inspired computing.  ( 2 min )
    Online Learning of Smooth Functions. (arXiv:2301.01434v1 [cs.LG])
    In this paper, we study the online learning of real-valued functions where the hidden function is known to have certain smoothness properties. Specifically, for $q \ge 1$, let $\mathcal F_q$ be the class of absolutely continuous functions $f: [0,1] \to \mathbb R$ such that $\|f'\|_q \le 1$. For $q \ge 1$ and $d \in \mathbb Z^+$, let $\mathcal F_{q,d}$ be the class of functions $f: [0,1]^d \to \mathbb R$ such that any function $g: [0,1] \to \mathbb R$ formed by fixing all but one parameter of $f$ is in $\mathcal F_q$. For any class of real-valued functions $\mathcal F$ and $p>0$, let $\text{opt}_p(\mathcal F)$ be the best upper bound on the sum of $p^{\text{th}}$ powers of absolute prediction errors that a learner can guarantee in the worst case. In the single-variable setup, we find new bounds for $\text{opt}_p(\mathcal F_q)$ that are sharp up to a constant factor. We show for all $\varepsilon \in (0, 1)$ that $\text{opt}_{1+\varepsilon}(\mathcal{F}_{\infty}) = \Theta(\varepsilon^{-\frac{1}{2}})$ and $\text{opt}_{1+\varepsilon}(\mathcal{F}_q) = \Theta(\varepsilon^{-\frac{1}{2}})$ for all $q \ge 2$. We also show for $\varepsilon \in (0,1)$ that $\text{opt}_2(\mathcal F_{1+\varepsilon})=\Theta(\varepsilon^{-1})$. In addition, we obtain new exact results by proving that $\text{opt}_p(\mathcal F_q)=1$ for $q \in (1,2)$ and $p \ge 2+\frac{1}{q-1}$. In the multi-variable setup, we establish inequalities relating $\text{opt}_p(\mathcal F_{q,d})$ to $\text{opt}_p(\mathcal F_q)$ and show that $\text{opt}_p(\mathcal F_{\infty,d})$ is infinite when $pd$. We also obtain sharp bounds on learning $\mathcal F_{\infty,d}$ for $p < d$ when the number of trials is bounded.  ( 2 min )
    DADAgger: Disagreement-Augmented Dataset Aggregation. (arXiv:2301.01348v1 [cs.LG])
    DAgger is an imitation algorithm that aggregates its original datasets by querying the expert on all samples encountered during training. In order to reduce the number of samples queried, we propose a modification to DAgger, known as DADAgger, which only queries the expert for state-action pairs that are out of distribution (OOD). OOD states are identified by measuring the variance of the action predictions of an ensemble of models on each state, which we simulate using dropout. Testing on the Car Racing and Half Cheetah environments achieves comparable performance to DAgger but with reduced expert queries, and better performance than a random sampling baseline. We also show that our algorithm may be used to build efficient, well-balanced training datasets by running with no initial data and only querying the expert to resolve uncertainty.  ( 2 min )
    An ensemble-based framework for mispronunciation detection of Arabic phonemes. (arXiv:2301.01378v1 [cs.SD])
    Determination of mispronunciations and ensuring feedback to users are maintained by computer-assisted language learning (CALL) systems. In this work, we introduce an ensemble model that defines the mispronunciation of Arabic phonemes and assists learning of Arabic, effectively. To the best of our knowledge, this is the very first attempt to determine the mispronunciations of Arabic phonemes employing ensemble learning techniques and conventional machine learning models, comprehensively. In order to observe the effect of feature extraction techniques, mel-frequency cepstrum coefficients (MFCC), and Mel spectrogram are blended with each learning algorithm. To show the success of proposed model, 29 letters in the Arabic phonemes, 8 of which are hafiz, are voiced by a total of 11 different person. The amount of data set has been enhanced employing the methods of adding noise, time shifting, time stretching, pitch shifting. Extensive experiment results demonstrate that the utilization of voting classifier as an ensemble algorithm with Mel spectrogram feature extraction technique exhibits remarkable classification result with 95.9% of accuracy.  ( 2 min )
    Task Weighting in Meta-learning with Trajectory Optimisation. (arXiv:2301.01400v1 [cs.LG])
    Developing meta-learning algorithms that are un-biased toward a subset of training tasks often requires hand-designed criteria to weight tasks, potentially resulting in sub-optimal solutions. In this paper, we introduce a new principled and fully-automated task-weighting algorithm for meta-learning methods. By considering the weights of tasks within the same mini-batch as an action, and the meta-parameter of interest as the system state, we cast the task-weighting meta-learning problem to a trajectory optimisation and employ the iterative linear quadratic regulator to determine the optimal action or weights of tasks. We theoretically show that the proposed algorithm converges to an $\epsilon_{0}$-stationary point, and empirically demonstrate that the proposed approach out-performs common hand-engineering weighting methods in two few-shot learning benchmarks.  ( 2 min )
    Brain Tissue Segmentation Across the Human Lifespan via Supervised Contrastive Learning. (arXiv:2301.01369v1 [eess.IV])
    Automatic segmentation of brain MR images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is critical for tissue volumetric analysis and cortical surface reconstruction. Due to dramatic structural and appearance changes associated with developmental and aging processes, existing brain tissue segmentation methods are only viable for specific age groups. Consequently, methods developed for one age group may fail for another. In this paper, we make the first attempt to segment brain tissues across the entire human lifespan (0-100 years of age) using a unified deep learning model. To overcome the challenges related to structural variability underpinned by biological processes, intensity inhomogeneity, motion artifacts, scanner-induced differences, and acquisition protocols, we propose to use contrastive learning to improve the quality of feature representations in a latent space for effective lifespan tissue segmentation. We compared our approach with commonly used segmentation methods on a large-scale dataset of 2,464 MR images. Experimental results show that our model accurately segments brain tissues across the lifespan and outperforms existing methods.  ( 2 min )
    Identifying Exoplanets with Deep Learning. V. Improved Light Curve Classification for TESS Full Frame Image Observations. (arXiv:2301.01371v1 [astro-ph.EP])
    The TESS mission produces a large amount of time series data, only a small fraction of which contain detectable exoplanetary transit signals. Deep learning techniques such as neural networks have proved effective at differentiating promising astrophysical eclipsing candidates from other phenomena such as stellar variability and systematic instrumental effects in an efficient, unbiased and sustainable manner. This paper presents a high quality dataset containing light curves from the Primary Mission and 1st Extended Mission full frame images and periodic signals detected via Box Least Squares (Kov\'acs et al. 2002; Hartman 2012). The dataset was curated using a thorough manual review process then used to train a neural network called Astronet-Triage-v2. On our test set, for transiting/eclipsing events we achieve a 99.6% recall (true positives over all data with positive labels) at a precision of 75.7% (true positives over all predicted positives). Since 90% of our training data is from the Primary Mission, we also test our ability to generalize on held-out 1st Extended Mission data. Here, we find an area under the precision-recall curve of 0.965, a 4% improvement over Astronet-Triage (Yu et al. 2019). On the TESS Object of Interest (TOI) Catalog through April 2022, a shortlist of planets and planet candidates, Astronet-Triage-v2 is able to recover 3577 out of 4140 TOIs, while Astronet-Triage only recovers 3349 targets at an equal level of precision. In other words, upgrading to Astronet-Triage-v2 helps save at least 200 planet candidates from being lost. The new model is currently used for planet candidate triage in the Quick-Look Pipeline (Huang et al. 2020a,b; Kunimoto et al. 2021).  ( 3 min )
    Recent Advances on Federated Learning: A Systematic Survey. (arXiv:2301.01299v1 [cs.LG])
    Federated learning has emerged as an effective paradigm to achieve privacy-preserving collaborative learning among different parties. Compared to traditional centralized learning that requires collecting data from each party, in federated learning, only the locally trained models or computed gradients are exchanged, without exposing any data information. As a result, it is able to protect privacy to some extent. In recent years, federated learning has become more and more prevalent and there have been many surveys for summarizing related methods in this hot research topic. However, most of them focus on a specific perspective or lack the latest research progress. In this paper, we provide a systematic survey on federated learning, aiming to review the recent advanced federated methods and applications from different aspects. Specifically, this paper includes four major contributions. First, we present a new taxonomy of federated learning in terms of the pipeline and challenges in federated scenarios. Second, we summarize federated learning methods into several categories and briefly introduce the state-of-the-art methods under these categories. Third, we overview some prevalent federated learning frameworks and introduce their features. Finally, some potential deficiencies of current methods and several future directions are discussed.  ( 2 min )
    oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation. (arXiv:2301.01333v1 [cs.LG])
    With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives does not fully exploit the performance potential of AI hardware. Various efforts are made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve end-to-end compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate up to 2x performance gains over primitives-based optimization for performance-critical DNN computation graph patterns on Intel Xeon Scalable Processors.  ( 2 min )
    Neural SDEs for Conditional Time Series Generation and the Signature-Wasserstein-1 metric. (arXiv:2301.01315v1 [stat.ML])
    (Conditional) Generative Adversarial Networks (GANs) have found great success in recent years, due to their ability to approximate (conditional) distributions over extremely high dimensional spaces. However, they are highly unstable and computationally expensive to train, especially in the time series setting. Recently, it has been proposed the use of a key object in rough path theory, called the signature of a path, which is able to convert the min-max formulation given by the (conditional) GAN framework into a classical minimization problem. However, this method is extremely expensive in terms of memory cost, sometimes even becoming prohibitive. To overcome this, we propose the use of \textit{Conditional Neural Stochastic Differential Equations}, which have a constant memory cost as a function of depth, being more memory efficient than traditional deep learning architectures. We empirically test that this proposed model is more efficient than other classical approaches, both in terms of memory cost and computational time, and that it usually outperforms them in terms of performance.  ( 2 min )
    Towards Deployable RL -- What's Broken with RL Research and a Potential Fix. (arXiv:2301.01320v1 [cs.LG])
    Reinforcement learning (RL) has demonstrated great potential, but is currently full of overhyping and pipe dreams. We point to some difficulties with current research which we feel are endemic to the direction taken by the community. To us, the current direction is not likely to lead to "deployable" RL: RL that works in practice and can work in practical situations yet still is economically viable. We also propose a potential fix to some of the difficulties of the field.  ( 2 min )
    Contextual Conservative Q-Learning for Offline Reinforcement Learning. (arXiv:2301.01298v1 [cs.LG])
    Offline reinforcement learning learns an effective policy on offline datasets without online interaction, and it attracts persistent research attention due to its potential of practical application. However, extrapolation error generated by distribution shift will still lead to the overestimation for those actions that transit to out-of-distribution(OOD) states, which degrades the reliability and robustness of the offline policy. In this paper, we propose Contextual Conservative Q-Learning(C-CQL) to learn a robustly reliable policy through the contextual information captured via an inverse dynamics model. With the supervision of the inverse dynamics model, it tends to learn a policy that generates stable transition at perturbed states, for the fact that pertuebed states are a common kind of OOD states. In this manner, we enable the learnt policy more likely to generate transition that destines to the empirical next state distributions of the offline dataset, i.e., robustly reliable transition. Besides, we theoretically reveal that C-CQL is the generalization of the Conservative Q-Learning(CQL) and aggressive State Deviation Correction(SDC). Finally, experimental results demonstrate the proposed C-CQL achieves the state-of-the-art performance in most environments of offline Mujoco suite and a noisy Mujoco setting.  ( 2 min )
    Decentralized Gradient Tracking with Local Steps. (arXiv:2301.01313v1 [math.OC])
    Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model). A key feature of GT is a tracking mechanism that allows to overcome data heterogeneity between nodes. We develop a novel decentralized tracking mechanism, $K$-GT, that enables communication-efficient local updates in GT while inheriting the data-independence property of GT. We prove a convergence rate for $K$-GT on smooth non-convex functions and prove that it reduces the communication overhead asymptotically by a linear factor $K$, where $K$ denotes the number of local steps. We illustrate the robustness and effectiveness of this heterogeneity correction on convex and non-convex benchmark problems and on a non-convex neural network training task with the MNIST dataset.  ( 2 min )
    Operator theory, kernels, and Feedforward Neural Networks. (arXiv:2301.01327v1 [cs.LG])
    In this paper we show how specific families of positive definite kernels serve as powerful tools in analyses of iteration algorithms for multiple layer feedforward Neural Network models. Our focus is on particular kernels that adapt well to learning algorithms for data-sets/features which display intrinsic self-similarities at feedforward iterations of scaling.  ( 2 min )
  • Open

    GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond. (arXiv:2211.01962v3 [cs.LG] UPDATED)
    We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs and PSRs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL.  ( 3 min )
    Best Arm Identification with Contextual Information under a Small Gap. (arXiv:2209.07330v4 [cs.LG] UPDATED)
    We study the best-arm identification (BAI) problem with a fixed budget and contextual (covariate) information. In each round of an adaptive experiment, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, which is a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. In this study, we consider a class of nonparametric bandit models that converge to location-shift models when the gaps go to zero. First, we derive lower bounds of the misidentification probability for a certain class of strategies and bandit models (probabilistic models of potential outcomes) under a small-gap regime. A small-gap regime is a situation where gaps of the expected rewards between the best and suboptimal treatment arms go to zero, which corresponds to one of the worst cases in identifying the best treatment arm. We then develop the ``Random Sampling (RS)-Augmented Inverse Probability weighting (AIPW) strategy,'' which is asymptotically optimal in the sense that the probability of misidentification under the strategy matches the lower bound when the budget goes to infinity in the small-gap regime. The RS-AIPW strategy consists of the RS rule tracking a target sample allocation ratio and the recommendation rule using the AIPW estimator.  ( 2 min )
    Constrained regret minimization for multi-criterion multi-armed bandits. (arXiv:2006.09649v2 [cs.LG] UPDATED)
    We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a 'primary' attribute subject to user-provided constraints on other 'secondary' attributes. We assume that the attributes can be estimated using samples from the arms' distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a Boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.  ( 2 min )
    The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. (arXiv:2210.05021v2 [cs.LG] UPDATED)
    Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between over-parameterized and under-parameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design.  ( 2 min )
    Time-uniform central limit theory, asymptotic confidence sequences, and anytime-valid causal inference. (arXiv:2103.06476v6 [math.ST] UPDATED)
    Confidence intervals based on the central limit theorem (CLT) are a cornerstone of classical statistics. Despite being only asymptotically valid, they are ubiquitous because they permit statistical inference under very weak assumptions, and can often be applied to problems even when nonasymptotic inference is impossible. This paper introduces time-uniform analogues of such asymptotic confidence intervals. To elaborate, our methods take the form of confidence sequences (CS) -- sequences of confidence intervals that are uniformly valid over time. CSs provide valid inference at arbitrary stopping times, incurring no penalties for "peeking" at the data, unlike classical confidence intervals which require the sample size to be fixed in advance. Existing CSs in the literature are nonasymptotic, and hence do not enjoy the aforementioned broad applicability of asymptotic confidence intervals. Our work bridges the gap by giving a definition for "asymptotic CSs", and deriving a universal asymptotic CS that requires only weak CLT-like assumptions. While the CLT approximates the distribution of a sample average by that of a Gaussian at a fixed sample size, we use strong invariance principles (stemming from the seminal 1970s work of Komlos, Major, and Tusnady) to uniformly approximate the entire sample average process by an implicit Gaussian process. We demonstrate their utility by deriving nonparametric asymptotic CSs for the average treatment effect based on doubly robust estimators in observational studies, for which no nonasymptotic methods can exist even in the fixed-time regime (due to confounding bias). These enable doubly robust causal inference that can be continuously monitored and adaptively stopped.  ( 3 min )
    Approximate blocked Gibbs sampling for Bayesian neural networks. (arXiv:2208.11389v2 [stat.ML] UPDATED)
    In this work, minibatch MCMC sampling for feedforward neural networks is made more feasible. To this end, it is proposed to sample subgroups of parameters via a blocked Gibbs sampling scheme. By partitioning the parameter space, sampling is possible irrespective of layer width. It is also possible to alleviate vanishing acceptance rates for increasing depth by reducing the proposal variance in deeper layers. Increasing the length of a non-convergent chain increases the predictive accuracy in classification tasks, so avoiding vanishing acceptance rates and consequently enabling longer chain runs have practical benefits. Moreover, non-convergent chain realizations aid in the quantification of predictive uncertainty. An open problem is how to perform minibatch MCMC sampling for feedforward neural networks in the presence of augmented data.  ( 2 min )
    Learning Gaussian Mixtures Using the Wasserstein-Fisher-Rao Gradient Flow. (arXiv:2301.01766v1 [math.ST])
    Gaussian mixture models form a flexible and expressive parametric family of distributions that has found applications in a wide variety of applications. Unfortunately, fitting these models to data is a notoriously hard problem from a computational perspective. Currently, only moment-based methods enjoy theoretical guarantees while likelihood-based methods are dominated by heuristics such as Expectation-Maximization that are known to fail in simple examples. In this work, we propose a new algorithm to compute the nonparametric maximum likelihood estimator (NPMLE) in a Gaussian mixture model. Our method is based on gradient descent over the space of probability measures equipped with the Wasserstein-Fisher-Rao geometry for which we establish convergence guarantees. In practice, it can be approximated using an interacting particle system where the weight and location of particles are updated alternately. We conduct extensive numerical experiments to confirm the effectiveness of the proposed algorithm compared not only to classical benchmarks but also to similar gradient descent algorithms with respect to simpler geometries. In particular, these simulations illustrate the benefit of updating both weight and location of the interacting particles.  ( 2 min )
    CI-GNN: A Granger Causality-Inspired Graph Neural Network for Interpretable Brain Network-Based Psychiatric Diagnosis. (arXiv:2301.01642v1 [stat.ML])
    There is a recent trend to leverage the power of graph neural networks (GNNs) for brain-network based psychiatric diagnosis, which,in turn, also motivates an urgent need for psychiatrists to fully understand the decision behavior of the used GNNs. However, most of the existing GNN explainers are either post-hoc in which another interpretive model needs to be created to explain a well-trained GNN, or do not consider the causal relationship between the extracted explanation and the decision, such that the explanation itself contains spurious correlations and suffers from weak faithfulness. In this work, we propose a granger causality-inspired graph neural network (CI-GNN), a built-in interpretable model that is able to identify the most influential subgraph (i.e., functional connectivity within brain regions) that is causally related to the decision (e.g., major depressive disorder patients or healthy controls), without the training of an auxillary interpretive network. CI-GNN learns disentangled subgraph-level representations {\alpha} and \b{eta} that encode, respectively, the causal and noncausal aspects of original graph under a graph variational autoencoder framework, regularized by a conditional mutual information (CMI) constraint. We theoretically justify the validity of the CMI regulation in capturing the causal relationship. We also empirically evaluate the performance of CI-GNN against three baseline GNNs and four state-of-the-art GNN explainers on synthetic data and two large-scale brain disease datasets. We observe that CI-GNN achieves the best performance in a wide range of metrics and provides more reliable and concise explanations which have clinical evidence.  ( 2 min )
    Scalable Optimal Design of Incremental Volt/VAR Control using Deep Neural Networks. (arXiv:2301.01440v1 [math.OC])
    Volt/VAR control rules facilitate the autonomous operation of distributed energy resources (DER) to regulate voltage in power distribution grids. According to non-incremental control rules, such as the one mandated by the IEEE Standard 1547, the reactive power setpoint of each DER is computed as a piecewise-linear curve of the local voltage. However, the slopes of such curves are upper-bounded to ensure stability. On the other hand, incremental rules add a memory term into the setpoint update, rendering them universally stable. They can thus attain enhanced steady-state voltage profiles. Optimal rule design (ORD) for incremental rules can be formulated as a bilevel program. We put forth a scalable solution by reformulating ORD as training a deep neural network (DNN). This DNN emulates the Volt/VAR dynamics for incremental rules derived as iterations of proximal gradient descent (PGD). Analytical findings and numerical tests corroborate that the proposed ORD solution can be neatly adapted to single/multi-phase feeders.  ( 2 min )
    Lessons Learned Applying Deep Learning Approaches to Forecasting Complex Seasonal Behavior. (arXiv:2301.01476v1 [stat.AP])
    Deep learning methods have gained popularity in recent years through the media and the relative ease of implementation through open source packages such as Keras. We investigate the applicability of popular recurrent neural networks in forecasting call center volumes at a large financial services company. These series are highly complex with seasonal patterns - between hours of the day, day of the week, and time of the year - in addition to autocorrelation between individual observations. Though we investigate the financial services industry, the recommendations for modeling cyclical nonlinear behavior generalize across all sectors. We explore the optimization of parameter settings and convergence criteria for Elman (simple), Long Short-Term Memory (LTSM), and Gated Recurrent Unit (GRU) RNNs from a practical point of view. A designed experiment using actual call center data across many different "skills" (income call streams) compares performance measured by validation error rates of the best observed RNN configurations against other modern and classical forecasting techniques. We summarize the utility of and considerations required for using deep learning methods in forecasting.  ( 2 min )
    Large-width asymptotics for ReLU neural networks with $\alpha$-Stable initializations. (arXiv:2206.08065v3 [cs.LG] UPDATED)
    There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for $\alpha$-Stable NNs, namely NNs whose weights are initialized as $\alpha$-Stable distributions with $\alpha\in(0,2]$. First, for $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of $\alpha$-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the $\alpha$-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $\alpha$-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the $\alpha$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.  ( 3 min )
    Counterfactual Explanations for Land Cover Mapping in a Multi-class Setting. (arXiv:2301.01520v1 [cs.LG])
    Counterfactual explanations are an emerging tool to enhance interpretability of deep learning models. Given a sample, these methods seek to find and display to the user similar samples across the decision boundary. In this paper, we propose a generative adversarial counterfactual approach for satellite image time series in a multi-class setting for the land cover classification task. One of the distinctive features of the proposed approach is the lack of prior assumption on the targeted class for a given counterfactual explanation. This inherent flexibility allows for the discovery of interesting information on the relationship between land cover classes. The other feature consists of encouraging the counterfactual to differ from the original sample only in a small and compact temporal segment. These time-contiguous perturbations allow for a much sparser and, thus, interpretable solution. Furthermore, plausibility/realism of the generated counterfactual explanations is enforced via the proposed adversarial learning strategy.  ( 2 min )
    A General Framework for Learning Mean-Field Games. (arXiv:2003.06069v3 [cs.LG] UPDATED)
    This paper presents a general mean-field game (GMFG) framework for simultaneous learning and decision-making in stochastic games with a large population. It first establishes the existence of a unique Nash Equilibrium to this GMFG, and demonstrates that naively combining reinforcement learning with the fixed-point approach in classical MFGs yields unstable algorithms. It then proposes value-based and policy-based reinforcement learning algorithms (GMF-V and GMF-P, respectively) with smoothed policies, with analysis of their convergence properties and computational complexities. Experiments on an equilibrium product pricing problem demonstrate that GMF-V-Q and GMF-P-TRPO, two specific instantiations of GMF-V and GMF-P, respectively, with Q-learning and TRPO, are both efficient and robust in the GMFG setting. Moreover, their performance is superior in convergence speed, accuracy, and stability when compared with existing algorithms for multi-agent reinforcement learning in the $N$-player setting.  ( 2 min )
    Federated Learning for Data Streams. (arXiv:2301.01542v1 [cs.LG])
    Federated learning (FL) is an effective solution to train machine learning models on the increasing amount of data generated by IoT devices and smartphones while keeping such data localized. Most previous work on federated learning assumes that clients operate on static datasets collected before training starts. This approach may be inefficient because 1) it ignores new samples clients collect during training, and 2) it may require a potentially long preparatory phase for clients to collect enough data. Moreover, learning on static datasets may be simply impossible in scenarios with small aggregate storage across devices. It is, therefore, necessary to design federated algorithms able to learn from data streams. In this work, we formulate and study the problem of \emph{federated learning for data streams}. We propose a general FL algorithm to learn from data streams through an opportune weighted empirical risk minimization. Our theoretical analysis provides insights to configure such an algorithm, and we evaluate its performance on a wide range of machine learning tasks.  ( 2 min )
    Matrices with Gaussian noise: optimal estimates for singular subspace perturbation. (arXiv:1803.00679v2 [stat.ML] UPDATED)
    The Davis--Kahan--Wedin $\sin \Theta$ theorem describes how the singular subspaces of a matrix change when subjected to a small perturbation. This classic result is sharp in the worst case scenario. In this paper, we prove a stochastic version of the Davis--Kahan--Wedin $\sin \Theta$ theorem when the perturbation is a Gaussian random matrix. Under certain structural assumptions, we obtain an optimal bound that significantly improves upon the classic Davis--Kahan--Wedin $\sin \Theta$ theorem. One of our key tools is a new perturbation bound for the singular values, which may be of independent interest.  ( 2 min )
    First-order penalty methods for bilevel optimization. (arXiv:2301.01716v1 [math.OC])
    In this paper we study a class of unconstrained and constrained bilevel optimization problems in which the lower-level part is a convex optimization problem, while the upper-level part is possibly a nonconvex optimization problem. In particular, we propose penalty methods for solving them, whose subproblems turn out to be a structured minimax problem and are suitably solved by a first-order method developed in this paper. Under some suitable assumptions, an \emph{operation complexity} of ${\cal O}(\varepsilon^{-4}\log\varepsilon^{-1})$ and ${\cal O}(\varepsilon^{-7}\log\varepsilon^{-1})$, measured by their fundamental operations, is established for the proposed penalty methods for finding an $\varepsilon$-KKT solution of the unconstrained and constrained bilevel optimization problems, respectively. To the best of our knowledge, the methodology and results in this paper are new.  ( 2 min )
    Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks. (arXiv:2006.07356v3 [stat.ML] UPDATED)
    We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.  ( 2 min )
    Kernel Subspace and Feature Extraction. (arXiv:2301.01410v1 [cs.LG])
    We study kernel methods in machine learning from the perspective of feature subspace. We establish a one-to-one correspondence between feature subspaces and kernels and propose an information-theoretic measure for kernels. In particular, we construct a kernel from Hirschfeld--Gebelein--R\'{e}nyi maximal correlation functions, coined the maximal correlation kernel, and demonstrate its information-theoretic optimality. We use the support vector machine (SVM) as an example to illustrate a connection between kernel methods and feature extraction approaches. We show that the kernel SVM on maximal correlation kernel achieves minimum prediction error. Finally, we interpret the Fisher kernel as a special maximal correlation kernel and establish its optimality.  ( 2 min )
    Geometric Ergodicity in Modified Variations of Riemannian Manifold and Lagrangian Monte Carlo. (arXiv:2301.01409v1 [stat.ME])
    Riemannian manifold Hamiltonian (RMHMC) and Lagrangian Monte Carlo (LMC) have emerged as powerful methods of Bayesian inference. Unlike Euclidean Hamiltonian Monte Carlo (EHMC) and the Metropolis-adjusted Langevin algorithm (MALA), the geometric ergodicity of these Riemannian algorithms has not been extensively studied. On the other hand, the manifold Metropolis-adjusted Langevin algorithm (MMALA) has recently been shown to exhibit geometric ergodicity under certain conditions. This work investigates the mixture of the LMC and RMHMC transition kernels with MMALA in order to equip the resulting method with an "inherited" geometric ergodicity theory. We motivate this mixture kernel based on an analogy between single-step HMC and MALA. We then proceed to evaluate the original and modified transition kernels on several benchmark Bayesian inference tasks.  ( 2 min )
    Online Learning of Smooth Functions. (arXiv:2301.01434v1 [cs.LG])
    In this paper, we study the online learning of real-valued functions where the hidden function is known to have certain smoothness properties. Specifically, for $q \ge 1$, let $\mathcal F_q$ be the class of absolutely continuous functions $f: [0,1] \to \mathbb R$ such that $\|f'\|_q \le 1$. For $q \ge 1$ and $d \in \mathbb Z^+$, let $\mathcal F_{q,d}$ be the class of functions $f: [0,1]^d \to \mathbb R$ such that any function $g: [0,1] \to \mathbb R$ formed by fixing all but one parameter of $f$ is in $\mathcal F_q$. For any class of real-valued functions $\mathcal F$ and $p>0$, let $\text{opt}_p(\mathcal F)$ be the best upper bound on the sum of $p^{\text{th}}$ powers of absolute prediction errors that a learner can guarantee in the worst case. In the single-variable setup, we find new bounds for $\text{opt}_p(\mathcal F_q)$ that are sharp up to a constant factor. We show for all $\varepsilon \in (0, 1)$ that $\text{opt}_{1+\varepsilon}(\mathcal{F}_{\infty}) = \Theta(\varepsilon^{-\frac{1}{2}})$ and $\text{opt}_{1+\varepsilon}(\mathcal{F}_q) = \Theta(\varepsilon^{-\frac{1}{2}})$ for all $q \ge 2$. We also show for $\varepsilon \in (0,1)$ that $\text{opt}_2(\mathcal F_{1+\varepsilon})=\Theta(\varepsilon^{-1})$. In addition, we obtain new exact results by proving that $\text{opt}_p(\mathcal F_q)=1$ for $q \in (1,2)$ and $p \ge 2+\frac{1}{q-1}$. In the multi-variable setup, we establish inequalities relating $\text{opt}_p(\mathcal F_{q,d})$ to $\text{opt}_p(\mathcal F_q)$ and show that $\text{opt}_p(\mathcal F_{\infty,d})$ is infinite when $pd$. We also obtain sharp bounds on learning $\mathcal F_{\infty,d}$ for $p < d$ when the number of trials is bounded.  ( 2 min )
    Testing High-dimensional Multinomials with Applications to Text Analysis. (arXiv:2301.01381v1 [stat.ME])
    Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyze two real-world datasets to examine variation among consumer reviews of Amazon movies and diversity of statistical paper abstracts.  ( 2 min )
    Covariate-guided Bayesian mixture model for multivariate time series. (arXiv:2301.01373v1 [stat.ME])
    With rapid development of techniques to measure brain activity and structure, statistical methods for analyzing modern brain-imaging play an important role in the advancement of science. Imaging data that measure brain function are usually multivariate time series and are heterogeneous across both imaging sources and subjects, which lead to various statistical and computational challenges. In this paper, we propose a group-based method to cluster a collection of multivariate time series via a Bayesian mixture of smoothing splines. Our method assumes each multivariate time series is a mixture of multiple components with different mixing weights. Time-independent covariates are assumed to be associated with the mixture components and are incorporated via logistic weights of a mixture-of-experts model. We formulate this approach under a fully Bayesian framework using Gibbs sampling where the number of components is selected based on a deviance information criterion. The proposed method is compared to existing methods via simulation studies and is applied to a study on functional near-infrared spectroscopy (fNIRS), which aims to understand infant emotional reactivity and recovery from stress. The results reveal distinct patterns of brain activity, as well as associations between these patterns and selected covariates.  ( 2 min )
    Measuring tail risk at high-frequency: An $L_1$-regularized extreme value regression approach with unit-root predictors. (arXiv:2301.01362v1 [econ.EM])
    We study tail risk dynamics in high-frequency financial markets and their connection with trading activity and market uncertainty. We introduce a dynamic extreme value regression model accommodating both stationary and local unit-root predictors to appropriately capture the time-varying behaviour of the distribution of high-frequency extreme losses. To characterize trading activity and market uncertainty, we consider several volatility and liquidity predictors, and propose a two-step adaptive $L_1$-regularized maximum likelihood estimator to select the most appropriate ones. We establish the oracle property of the proposed estimator for selecting both stationary and local unit-root predictors, and show its good finite sample properties in an extensive simulation study. Studying the high-frequency extreme losses of nine large liquid U.S. stocks using 42 liquidity and volatility predictors, we find the severity of extreme losses to be well predicted by low levels of price impact in period of high volatility of liquidity and volatility.  ( 2 min )
    Towards Deployable RL -- What's Broken with RL Research and a Potential Fix. (arXiv:2301.01320v1 [cs.LG])
    Reinforcement learning (RL) has demonstrated great potential, but is currently full of overhyping and pipe dreams. We point to some difficulties with current research which we feel are endemic to the direction taken by the community. To us, the current direction is not likely to lead to "deployable" RL: RL that works in practice and can work in practical situations yet still is economically viable. We also propose a potential fix to some of the difficulties of the field.  ( 2 min )
    Neural SDEs for Conditional Time Series Generation and the Signature-Wasserstein-1 metric. (arXiv:2301.01315v1 [stat.ML])
    (Conditional) Generative Adversarial Networks (GANs) have found great success in recent years, due to their ability to approximate (conditional) distributions over extremely high dimensional spaces. However, they are highly unstable and computationally expensive to train, especially in the time series setting. Recently, it has been proposed the use of a key object in rough path theory, called the signature of a path, which is able to convert the min-max formulation given by the (conditional) GAN framework into a classical minimization problem. However, this method is extremely expensive in terms of memory cost, sometimes even becoming prohibitive. To overcome this, we propose the use of \textit{Conditional Neural Stochastic Differential Equations}, which have a constant memory cost as a function of depth, being more memory efficient than traditional deep learning architectures. We empirically test that this proposed model is more efficient than other classical approaches, both in terms of memory cost and computational time, and that it usually outperforms them in terms of performance.  ( 2 min )

  • Open

    Will A.I generated shit like art and text be undetectable some day? Will there always be some other A.I that can detect that shit?
    submitted by /u/WaggleMcDaggle [link] [comments]  ( 50 min )
    ai to write quotes for me
    Hi everyone So i write a lot of quotes to do with various building jobs on ms word and I'm wondering if there's an ai that i can get to read all the quotes I've ever written and then get to write quotes for me based on a few promts. eg if i say replace a bathroom with a shower and bath in it and it will write the quote or the brunt of it for me. submitted by /u/pummers88 [link] [comments]  ( 52 min )
    Have AI generate Git commit messages for you
    submitted by /u/abisknees [link] [comments]  ( 51 min )
    ChatGPT and the unbundling of online search
    submitted by /u/bendee983 [link] [comments]  ( 60 min )
    WSJ News Exclusive | ChatGPT Creator OpenAI Is in Talks for Tender Offer That Would Value It at $29 Billion
    submitted by /u/Plopfish [link] [comments]  ( 50 min )
    The AI newsletter that covers the coolest things happening in AI in simple-to-understand bites. Each issue covers: 3 cool use cases or tools, 2 big stories, and 1 funny meme or tweet about AI.
    submitted by /u/Folly237 [link] [comments]  ( 80 min )
    Text-to-law: Are large language models good lobbyists?
    submitted by /u/Peaking_AI [link] [comments]  ( 50 min )
    does anyone know any good ai to detect chords or notes when loading a song??
    submitted by /u/Pollo3652 [link] [comments]  ( 51 min )
    Baidu Research Releases Top 10 Tech Trends for 2023
    ​ https://preview.redd.it/oca9tqrfq9aa1.png?width=1920&format=png&auto=webp&s=8194a46975518ee82e5ecbd4654cbb2b69739545 Baidu Research today released its predictions for the top 10 technology trends of 2023, including big models, digital-real convergence, virtual-real symbiosis, autonomous driving, robotics, scientific computing, quantum computing, privacy computing, ethics in technology, and sustainability. For more: http://research.baidu.com/Blog/index-view?id=178 submitted by /u/trcytony [link] [comments]  ( 50 min )
    I solo developed an avatar app. It's faster and cheaper. Looking for feedback!
    submitted by /u/okaris [link] [comments]  ( 50 min )
    Why are coders/programmers persuaded that they won't be replaced by Al?
    They are like other jobs. If Al can master emotional/artistic jobs, medicine, law etc why not something like coding? Personally I believe that 95% of each field will loose their jobs and the 5% that will remain will have more a supervisory role. P submitted by /u/Ok_Dragonfly_2167 [link] [comments]  ( 63 min )
    Adoption of AI in Medical Imaging Diagnosis
    The adoption of artificial intelligence (AI) in medical imaging has the potential to transform the way we diagnose and treat various health conditions. From X-rays to MRI scans, these techniques allow doctors to see inside the human body and identify problems that might not be visible to the naked eye. However, the adoption of AI in the medical field is not without challenges. A recent study has explored the determinants for the adoption of AI-powered decision-making support systems in the medical imaging workflow. The study used data from clinicians who participated in an international evaluation of healthcare practitioners and applied confirmatory factor analysis and structural equation modeling to understand the factors that influence the adoption and usage of AI in medical imaging. The results of the study showed that understanding the role of security, risk, and trust is crucial for the intended usage of AI in the medical imaging workflow. These findings provide valuable insights for researchers and practitioners looking to understand the adoption and acceptance of AI in the medical field. As AI continues to make advances in the field of medical imaging, it is important to consider the role of these factors in the adoption and acceptance of AI by healthcare practitioners. By understanding and addressing these challenges, we can ensure that the adoption of AI in medical imaging is seamless and effective. In your opinion, what are the biggest challenges to the widespread adoption of AI in the medical imaging workflow? submitted by /u/FMCalisto [link] [comments]  ( 53 min )
    Did anyone have this with Meitu AI ? It only puts blood on my photos
    submitted by /u/LastTranslator8157 [link] [comments]  ( 50 min )
    Death of the narrator? Apple unveils suite of AI-voiced audiobooks
    Apple quietly launched a catalogue of books narrated by AI in a move that may mark the beginning of the end for human narrators. The strategy marks an attempt to upend the lucrative and fast-growing audiobook market - but it also promises to intensify scrutiny over allegations of Apple's anti-competitive behaviour. Apple's development of AI to narrate books could represent a significant shift in how major technology companies see the future of audiobooks. In recent months, Apple approached independent publishers as potential partners, including some in the Canadian market, but not all agreed to participate Apple would shoulder the costs of production and writers would receive royalties from sales. The Future While there is potential for backlash by professional voice actors, authors themselves are increasingly being asked to narrate their own books There is a financial incentive for the writers, both in the upfront payments and the expanded availability of their work But producing an audiobook with a human voice can take weeks and can cost publishers thousands of dollars Apple and Amazon For years, Apple has sold books and audiobooks through its Books app Apple was rumored to be interested in developing its own audiobook service and shifting from a reseller to a producer The move represents a direct shot at rival Amazon, with Apple listing what it said were the benefits of its own system compared to Kindle's Direct Publishing Lawmakers in Europe and the United States have put in place increasing scrutiny of the company in the wake of allegations that Apple limits competition ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/new-world-goodbye-homework-elon-loves-ai-essays submitted by /u/Mk_Makanaki [link] [comments]  ( 56 min )
    Meet GPTZero: The AI-Powered Anti-Plagiarism Program
    submitted by /u/liquidocelotYT [link] [comments]  ( 51 min )
    Considering going into the ai field as a career, what is everyone's thoughts on the moral implications of a career?
    submitted by /u/tartigrade78 [link] [comments]  ( 58 min )
    How are those videos made ?
    Hi, I was simply wondering how that kind of videos are made. I guess there is a video "reference" layer (camera movement and some simple shapes ?), then an AI pass that applies some prompts on it ? https://www.youtube.com/watch?v=yeYUirOYiE4 https://twitter.com/ArtificialBob/status/1579404863825653763 ​ Could someone link a tutorial so I could understand the process ? Thanks ! submitted by /u/grosbouff [link] [comments]  ( 51 min )
    Top Trends in A.I. in 2023
    submitted by /u/BackgroundResult [link] [comments]  ( 24 min )
    Ai generared guide to riches
    Sure! Here is a more detailed version of the script adapted for a Reddit post: Hello Redditors! I wanted to share some practical strategies that you can use to get rich. Whether you're just starting out on your wealth-building journey or you're looking for ways to take your financial situation to the next level, these strategies can help. Define your financial goals Before you get started, it's important to define what "getting rich" means to you. This might involve setting a specific financial goal, like saving a certain amount of money or achieving a certain level of income. It's important to have a clear idea of what you want to achieve so that you can stay motivated and focused as you work towards financial success. Create a plan Once you have your goals in mind, it's time to devel…  ( 56 min )
    Sullivan King & Subtronics - Take Flight AI Music Video 4K 60 FPS
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 40 min )
    AI generated Sam Harris meditation
    submitted by /u/justLV [link] [comments]  ( 50 min )
    Innocent Man Arrested In Another Facial Recognition Failure
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 24 min )
    Open Source AI Resources/Recommendations?
    Does anyone in the community have any recommended open source AI projects to get that are still available? I've heard ChatGPT has a few open source community driven competitors? Additionally, does anyone have any recommended open source AI image/video upres programs they use? I really want to try and preserve (and hopefully contribute) to what's out there currently. I know there has been a large movement to keep open source AI projects alive while these private groups gain momentum. Also, this goes out to the mods, but maybe we should have a pinned post with a list of community driven/open source AI that people can download, mess with, and hopefully contribute to? submitted by /u/OldsDiesel [link] [comments]  ( 41 min )
  • Open

    [D] Can common pooled RAM and VRAM increase devices' capabilities with regards to operating on larger models?
    A lot of interesting models that came out recently have been using the transformer architecture, or otherwise required a lot of VRAM, such amounts, in fact, that it has become unattainable to even just run inference for almost everyone, because the amount of necessary VRAM can only be attained by pooling together numerous >1k usd GPUs. But VRAM in itself isn't that crazy expensive, it's the the rest of that super powerful GPU that's attached to them that's expensive. If I understand correctly, some manufacturers, like, say, Apple, have been using common pool of memory for both RAM and VRAM, thereby having GPUs with a crazy amount of VRAM. For example, their least expensive Mac Studio has an amount of VRAM equivalent to an A100. So my question is, does a unified memory pool (common RAM and VRAM) lead to more capable inference? And if so, should/will manufacturers gravitate towards such a hardware solution? And are there any technical obstacles inhibiting such a path? submitted by /u/whyvitamins [link] [comments]  ( 62 min )
    [D] Special-purpose "neuromorphic" chips for AI - current state of the art?
    There are a number of companies out there making special-purpose chip "neuromorphic" architectures that are supposed to be better suited for neural networks. Some of them you can buy for as little as $500. Most of them are designed for Spiking Neural Networks, probably because of the similarity to the human brain. Innatera's chip implements the neural network on an analog computer, which I find very interesting. Is the performance really better than GPUs? Could this achieve the the dream of running a model on as little power as the brain uses? Are spiking neural networks useful for anything? I don't know of any tasks where a SNN is the current state-of-the-art in performance. All the good results right now seem to be coming out of transformers, but maybe that's just because they're so well-suited for the hardware we have available. submitted by /u/currentscurrents [link] [comments]  ( 62 min )
    [D] SOTA Head Pose Estimation Models
    Hey guys! Just wondering what the state of the art is for off the shelf head pose estimation. I’m looking to compute the pitch, yaw, and roll values for the heads in some videos I’m processing. I need something that will work off the shelf that’s a bit more reliable than using landmark based methods. Thank you! submitted by /u/TightestKnees [link] [comments]  ( 60 min )
    [D] Has any research been done on using GANs to develop proteins for selective binding?
    While long ago I hailed from the realm of biochemistry, I'm sitting more on the data side these days and can't help but be nostalgic for the possible use of generative neural nets on the bio side. Having a little curiosity, I found that this work has already began somewhat. For instance, this group managed to get a reasonable design schema by incorporating Attention into their GANs: https://www.nature.com/articles/s42256-021-00310-5 To me then, if we can develop "reasonable" (i.e. soluble among other properties) protein sequences using GANs, a natural extension would be to train them for selective binding. For instance, imagine adding the loss from a pre-trained Discriminator that predicts binding a certain target into the above. On one hand, that seems like a tall order, but given some of the research I've thumbed through, I believe such a discriminator is already shown as tenable. Here's a few papers: A meta review of the topic An implementation using Attention Another implementation Being able to serve up amino acid sequences with selective binding properties seems really really attractive and a natural "next step" for GAN research here. The closest I have seen though is this work which is more of the flavor of redesigning a sequence rather than a de novo approach. Anyways, I'm no researcher in the field and don't have a dog in the fight so to speak, but it's just an exciting thought to me. Curious if it's been done yet? And if not, just interested in sparking some discussion in case others have similar interests. submitted by /u/jshkk [link] [comments]  ( 61 min )
    [Project] Public MIT IAP talks on speech & language ML starting Monday
    https://iap.gridspace.com/ First week is on sound and DSP. Second week on NLP and language. Third on speech ML models. And fourth week on applications. There's a signup google form on the site. submitted by /u/arrowoftime [link] [comments]  ( 58 min )
    Image matching within database? [P]
    A friend and I are working on a project that requires us to take images as input find images that match them from our database. What is the most effective way to do this? We've tried SIFT and a few similar solutions, but nothing's been super effective so far. Does anyone have any suggestions? Are there any solid open-source solutions? submitted by /u/Clarkmilo [link] [comments]  ( 62 min )
    [News] AMD Instinct MI300 APU for AI and HPC announced
    https://www.anandtech.com/show/18721/ces-2023-amd-instinct-mi300-data-center-apu-silicon-in-hand-146b-transistors-shipping-h223 I wonder if this is the beginning of dissolution of NVIDIA's monopoly on AI. submitted by /u/samobon [link] [comments]  ( 60 min )
    using RL for liquidity provision on Uniswap V3 [R]
    Hi, has anyone done any work in using ML or RL to automate providing liquidity in uniswapv3? Rewards are based on volume against the TVL, you earn a percentage of each trade, so high volume in a low liquidity market can earn some very good rewards as long as the price is stable enough that it doesn't effect the APR.. ideally i'd like to assess trends in volume based on TA indicators and then assign a portion of funds to the pool based on some forecasted returns.. i'm a complete newbie but come from a data engineering background, so i'd like to figure out of there's any where i can look into starting from or if the is a completely green field subject for now. submitted by /u/adamnmcc [link] [comments]  ( 24 min )
    [R] Airlift Challenge Competition
    Announcing the Airlift Challenge Competition Website: https://airliftchallenge.com The Airlift Challenge seeks improved algorithms to plan and execute airlift operations in the face of dynamic disruptions. Participants compete by submitting OpenAI Gym-based agents that minimize delivery time and cost using reinforcement learning, optimization, heuristics, or any other technique. Overview Airlifts demand the delivery of large sets of cargo into areas of need under tight deadlines. Yet, there are many obstacles preventing timely delivery. Airports have limited capacity to process airplanes, thus limiting throughput and potentially creating bottlenecks. Weather disruptions can cause delays or force airplanes to re-route. Unexpected cargo may be staged for an urgent delivery. This competi…  ( 63 min )
    Combining pre-trained word embeddings [R]
    Which is the best possible method to combine the embeddings of two words. Example is "comparison" + "contrast" and "expansion" and "restatement". So I want the ML model to understand that contrast is more related to comparison and restatement is more related to expansion. My initial institution is vector addition of the embeddings of these words. I would like to hear you feedbacks on the same. Thanks in advance :-) submitted by /u/BothEntertainment786 [link] [comments]  ( 61 min )
    Is there another way to determine the effect of the features other than the inbuilt features importance and SHAP values? [Research] [Discussion]
    I am working on a classification churn model, I have used feature importance to show the features' importance at model fit. I used shap values to show permutation importance of each feature. But I feel it's harder to represent causation of the prediction to each feature. Any ideas ? submitted by /u/spb-world [link] [comments]  ( 59 min )
    Democratizing Index Tracking: A GNN-based Meta-Learning Method for Sparse Portfolio Optimization [R] [P]
    Have you ever wanted to invest in a US ETF or mutual fund, but found that many of the actively managed index trackers were expensive or out of reach due to regulations? I have recently developed a solution to this problem that allows small investors to create their sparse stock portfolios for tracking an index by proposing a novel population-based large-scale non-convex optimization method via a Deep Generative Model that learns to sample good portfolios. Sparse VGT Tracker - QuantConnect Backtest I've compared this approach to the state-of-the-art evolutionary strategy (Fast CMA-ES) and found that it is more efficient at finding optimal index-tracking portfolios. The PyTorch implementations of both methods and the dataset are available on my GitHub repository for reproducibility and further improvement. Check out the repository to learn more about this new meta-learning approach for evolutionary optimization, or run your small index fund at home! Best Index-Tracking Validation Loss Achieved on Out-of-Sample Period in 100 Epochs submitted by /u/k_yuksel [link] [comments]  ( 64 min )
    [D] Isolation Forest
    I've a isolation forest model in production environment that trains every time we try to find anomalies and it classifies different points as anomalies. What should I do such that only most likely anomaly points are provided as results and results don't differ much? I do not want to set random state. submitted by /u/TKMater [link] [comments]  ( 58 min )
    [P] Learning chatbot for MOS KIM-1 (1976, 6502 CPU, 1K RAM), run on KIM Uno
    Hi everyone! I programmed a learning chatbot for (a clone of) the MOS KIM-1, hailing from 1976 with its 6502 CPU and 1K RAM. Basically, it works this way - you give it a byte, and it answers with a byte; at the same time, it learns from each interaction, which byte "should" answer which, and updates its knowledge base accordingly. It actually runs on a KIM Uno, an Arduino based clone of the KIM-1. This is the GitHub page with the code, contained in two short programs: one (optional) to slighly pre-populate the knowledge base with about a dozen of bytes that would constitute a nucleus of original replies (to be evolved into "your" interactions, as you chat on), starting from $0100, as well as the actual chatbot program to be launched from $0200 (the "user input" byte is to be entered prior to run in $0010, and the reply will be contained after run at $0013, so yes, you are "chatting" in hex), in each case, both in assembler and already assembled (and ready to be entered into the KIM-1): https://github.com/KedalionDaimon/MOS-KIM-1-chatbot And this is my YouTube video, presenting it: https://www.youtube.com/watch?v=7MJgi5kua3M submitted by /u/NinoIvanov [link] [comments]  ( 63 min )
    [D] BERTopic and fuzzy clustering
    Hi everyone, I am having a chunk of textual conversational data which I need to analyze for topics in it. I am currently using BERTopic to do it but the issue with it is that it does hard clustering, i.e., each datapoint belongs to exactly one topic cluster. But many of the sentences are multi intent and do fall in many topics simultaneously. Can soft/fuzzy clustering , where each datapoint can belong to more than one topic clusters, be done via BERTopic? If yes, then how can it be implemented? If not, which other algorithms can be used? submitted by /u/Devinco001 [link] [comments]  ( 58 min )
  • Open

    AI Engineer: Learn About The Role And Skills Needed For Success In 2023
    Computer engineering has come a long way from merely being a curriculum major to a key credential for your AI portfolio. Artificial engineering is a role that commands huge skill and expertise in the realm of technology. Unsurprisingly, the demand for their services outstrips the supply. The industry is oozing with valuable companies with diversified… Read More »AI Engineer: Learn About The Role And Skills Needed For Success In 2023 The post AI Engineer: Learn About The Role And Skills Needed For Success In 2023 appeared first on Data Science Central.  ( 21 min )
    Seven steps to simple DIY market forecasting
    I became a true data hound after a stint as a market analyst in the semiconductor industry. The role involved these activities: The habits and mindset I developed as a data hound, forecaster, and spreadsheet jockey inform the way I track emerging data technology trends to this day. So I thought I’d share the main… Read More »Seven steps to simple DIY market forecasting The post Seven steps to simple DIY market forecasting appeared first on Data Science Central.  ( 21 min )
  • Open

    Can someone help me code (Python) Monte Carlo Tree Search to find the highest value leaf node. Don't need to be spoonfed. I just need a rough outline and someone to ask questions to. Thanks. Ref. Question 2
    submitted by /u/Kamal_Ata_Turk [link] [comments]  ( 55 min )
    Democratizing Index Tracking: A GNN-based Meta-Learning Method for Sparse Portfolio Optimization
    Have you ever wanted to invest in a US ETF or mutual fund, but found that many of the actively managed index trackers were expensive or out of reach due to regulations? I have recently developed a solution to this problem that allows small investors to create their sparse stock portfolios for tracking an index by proposing a novel population-based large-scale non-convex optimization method via a Deep Generative Model that learns to sample good portfolios. Sparse VGT Tracker - QuantConnect Backtest I've compared this approach to the state-of-the-art evolutionary strategy (Fast CMA-ES) and found that it is more efficient at finding optimal index-tracking portfolios. The PyTorch implementations of both methods and the dataset are available on my GitHub repository for reproducibility and further improvement. Check out the repository to learn more about this new meta-learning approach for evolutionary optimization, or run your small index fund at home! Best Index-Tracking Validation Loss Achieved on Out-of-Sample Period in 100 Epochs submitted by /u/k_yuksel [link] [comments]  ( 24 min )
    Reinforcement learning in ChatGPT
    Today, I read the paper about InstructGPT on which ChatGPT is based, and I was surprised to see that it uses reinforcement learning in the training process. It uses PPO to optimize its prompts on a reward signal given by another trained model. Though I found this approach really interesting, I was left wondering how prompts are created by the policy. I have never heard of PPO being used to generate text before, and I am very curious as to what this action-space looks like. Does anyone here have any insight on this? Also, it would be very interesting to hear what people here think RL advances can do for further research in generative models like this! submitted by /u/Embarrassed-Print-13 [link] [comments]  ( 57 min )
    How many steps are models usually trained before deployment?
    Since it depends on sample efficiency (which varies by method), the average method used to train the Model finally deployed is in consideration. View Poll submitted by /u/XecutionStyle [link] [comments]  ( 55 min )
  • Open

    Simulating discrimination in virtual reality
    The role-playing game “On the Plane” simulates xenophobia to foster greater understanding and reflection via virtual experiences.  ( 9 min )
  • Open

    Tipping Point: NVIDIA DRIVE Scales AI-Powered Transportation at CES 2023
    Autonomous vehicle (AV) technology is heading to the mainstream. The NVIDIA DRIVE ecosystem showcased significant milestones toward widespread intelligent transportation at CES. Growth is occurring in vehicle deployment plans as well as AI solutions integrating further into the car. Foxconn joined the NVIDIA DRIVE ecosystem. The world’s largest technology manufacturer will produce electronic control units Read article >  ( 6 min )
    GFN Thursday Brings RTX 4080 to the Cloud With GeForce NOW Ultimate Membership
    GFN Thursday rings in the new year with a recap of the biggest cloud gaming news from CES 2023: the GeForce NOW Ultimate membership. Powered by the latest NVIDIA GPU technology, Ultimate members can play their favorite PC games at performance never before available from the cloud. Plus, with a new year comes new games. Read article >  ( 7 min )
  • Open

    Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence. (arXiv:2301.01218v1 [cs.CR])
    Deep neural networks are vulnerable to adversarial attacks. In this paper, we take the role of investigators who want to trace the attack and identify the source, that is, the particular model which the adversarial examples are generated from. Techniques derived would aid forensic investigation of attack incidents and serve as deterrence to potential attacks. We consider the buyers-seller setting where a machine learning model is to be distributed to various buyers and each buyer receives a slightly different copy with same functionality. A malicious buyer generates adversarial examples from a particular copy $\mathcal{M}_i$ and uses them to attack other copies. From these adversarial examples, the investigator wants to identify the source $\mathcal{M}_i$. To address this problem, we propose a two-stage separate-and-trace framework. The model separation stage generates multiple copies of a model for a same classification task. This process injects unique characteristics into each copy so that adversarial examples generated have distinct and traceable features. We give a parallel structure which embeds a ``tracer'' in each copy, and a noise-sensitive training loss to achieve this goal. The tracing stage takes in adversarial examples and a few candidate models, and identifies the likely source. Based on the unique features induced by the noise-sensitive loss function, we could effectively trace the potential adversarial copy by considering the output logits from each tracer. Empirical results show that it is possible to trace the origin of the adversarial example and the mechanism can be applied to a wide range of architectures and datasets.
    StarGraph: Knowledge Representation Learning based on Incomplete Two-hop Subgraph. (arXiv:2205.14209v2 [cs.CL] UPDATED)
    Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector, ignoring the rich information contained in the neighborhood. We propose a method named StarGraph, which gives a novel way to utilize the neighborhood information for large-scale knowledge graphs to obtain entity representations. An incomplete two-hop neighborhood subgraph for each target node is at first generated, then processed by a modified self-attention network to obtain the entity representation, which is used to replace the entity embedding in conventional methods. We achieved SOTA performance on ogbl-wikikg2 and got competitive results on fb15k-237. The experimental results proves that StarGraph is efficient in parameters, and the improvement made on ogbl-wikikg2 demonstrates its great effectiveness of representation learning on large-scale knowledge graphs. The code is now available at \url{https://github.com/hzli-ucas/StarGraph}.
    Clustering Aware Classification for Risk Prediction and Subtyping in Clinical Data. (arXiv:2102.11872v5 [cs.LG] UPDATED)
    In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either 1) are classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature, despite its importance in several real-world applications. In this paper, first, we theoretically analyze the generalization performance of classifiers trained on clustered data and find conditions under which clustering can potentially aid classification. This motivates the design of a simple k-means-based classification algorithm called Clustering Aware Classification (CAC) and its neural variant {DeepCAC}. DeepCAC effectively leverages deep representation learning to learn latent embeddings and finds clusters in a manner that make the clustered data suitable for training classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of DeepCAC over previous methods for combined clustering and classification.
    Efficient Self-Supervision using Patch-based Contrastive Learning for Histopathology Image Segmentation. (arXiv:2208.10779v2 [cs.CV] UPDATED)
    Learning discriminative representations of unlabelled data is a challenging task. Contrastive self-supervised learning provides a framework to learn meaningful representations using learned notions of similarity measures from simple pretext tasks. In this work, we propose a simple and efficient framework for self-supervised image segmentation using contrastive learning on image patches, without using explicit pretext tasks or any further labeled fine-tuning. A fully convolutional neural network (FCNN) is trained in a self-supervised manner to discern features in the input images and obtain confidence maps which capture the network's belief about the objects belonging to the same class. Positive- and negative- patches are sampled based on the average entropy in the confidence maps for contrastive learning. Convergence is assumed when the information separation between the positive patches is small, and the positive-negative pairs is large. The proposed model only consists of a simple FCNN with 10.8k parameters and requires about 5 minutes to converge on the high resolution microscopy datasets, which is orders of magnitude smaller than the relevant self-supervised methods to attain similar performance. We evaluate the proposed method for the task of segmenting nuclei from two histopathology datasets, and show comparable performance with relevant self-supervised and supervised methods.
    Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples. (arXiv:2301.01217v1 [cs.CR])
    There is a growing interest in developing unlearnable examples (UEs) against visual privacy leaks on the Internet. UEs are training samples added with invisible but unlearnable noise, which have been found can prevent unauthorized training of machine learning models. UEs typically are generated via a bilevel optimization framework with a surrogate model to remove (minimize) errors from the original samples, and then applied to protect the data against unknown target models. However, existing UE generation methods all rely on an ideal assumption called label-consistency, where the hackers and protectors are assumed to hold the same label for a given sample. In this work, we propose and promote a more practical label-agnostic setting, where the hackers may exploit the protected data quite differently from the protectors. E.g., a m-class unlearnable dataset held by the protector may be exploited by the hacker as a n-class dataset. Existing UE generation methods are rendered ineffective in this challenging setting. To tackle this challenge, we present a novel technique called Unlearnable Clusters (UCs) to generate label-agnostic unlearnable examples with cluster-wise perturbations. Furthermore, we propose to leverage VisionandLanguage Pre-trained Models (VLPMs) like CLIP as the surrogate model to improve the transferability of the crafted UCs to diverse domains. We empirically verify the effectiveness of our proposed approach under a variety of settings with different datasets, target models, and even commercial platforms Microsoft Azure and Baidu PaddlePaddle.
    Gradient Descent Ascent for Minimax Problems on Riemannian Manifolds. (arXiv:2010.06097v5 [cs.LG] UPDATED)
    In the paper, we study a class of useful minimax problems on Riemanian manifolds and propose a class of effective Riemanian gradient-based methods to solve these minimax problems. Specifically, we propose an effective Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization. Moreover, we prove that our RGDA has a sample complexity of $O(\kappa^2\epsilon^{-2})$ for finding an $\epsilon$-stationary solution of the Geodesically-Nonconvex Strongly-Concave (GNSC) minimax problems, where $\kappa$ denotes the condition number. At the same time, we present an effective Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization, which has a sample complexity of $O(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary solution. To further reduce the sample complexity, we propose an accelerated Riemannian stochastic gradient descent ascent (Acc-RSGDA) algorithm based on the momentum-based variance-reduced technique. We prove that our Acc-RSGDA algorithm achieves a lower sample complexity of $\tilde{O}(\kappa^{4}\epsilon^{-3})$ in searching for an $\epsilon$-stationary solution of the GNSC minimax problems. Extensive experimental results on the robust distributional optimization and robust Deep Neural Networks (DNNs) training over Stiefel manifold demonstrate efficiency of our algorithms.
    Explaining Imitation Learning through Frames. (arXiv:2301.01088v1 [cs.LG])
    As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
    SpectroscopyNet: Learning to pre-process Spectroscopy Signals without clean data. (arXiv:2110.13748v2 [cs.LG] UPDATED)
    In this work we propose a deep learning approach to clean spectroscopy signals using only uncleaned data. Cleaning signals from spectroscopy instrument noise is challenging as noise exhibits an unknown, non-zero mean, multivariate distributions. Our framework is a siamese neural net that learns identifiable disentanglement of the signal and noise components under a stationarity assumption. The disentangled representations satisfy reconstruction fidelity, reduce consistencies with measurements of unrelated targets and imposes relaxed-orthogonality constraints between the signal and noise representations. Evaluations on a laser induced breakdown spectroscopy (LIBS) dataset from the ChemCam instrument onboard the Martian Curiosity rover show a superior performance in cleaning LIBS measurements compared to the standard feature engineered approaches being used by the ChemCam team.  ( 2 min )
    RAIDER: Reinforcement-aided Spear Phishing Detector. (arXiv:2105.07582v3 [cs.CR] UPDATED)
    Spear Phishing is a harmful cyber-attack facing business and individuals worldwide. Considerable research has been conducted recently into the use of Machine Learning (ML) techniques to detect spear-phishing emails. ML-based solutions may suffer from zero-day attacks; unseen attacks unaccounted for in the training data. As new attacks emerge, classifiers trained on older data are unable to detect these new varieties of attacks resulting in increasingly inaccurate predictions. Spear Phishing detection also faces scalability challenges due to the growth of the required features which is proportional to the number of the senders within a receiver mailbox. This differs from traditional phishing attacks which typically perform only a binary classification between phishing and benign emails. Therefore, we devise a possible solution to these problems, named RAIDER: Reinforcement AIded Spear Phishing DEtectoR. A reinforcement-learning based feature evaluation system that can automatically find the optimum features for detecting different types of attacks. By leveraging a reward and penalty system, RAIDER allows for autonomous features selection. RAIDER also keeps the number of features to a minimum by selecting only the significant features to represent phishing emails and detect spear-phishing attacks. After extensive evaluation of RAIDER over 11,000 emails and across 3 attack scenarios, our results suggest that using reinforcement learning to automatically identify the significant features could reduce the dimensions of the required features by 55% in comparison to existing ML-based systems. It also improves the accuracy of detecting spoofing attacks by 4% from 90% to 94%. In addition, RAIDER demonstrates reasonable detection accuracy even against a sophisticated attack named Known Sender in which spear-phishing emails greatly resemble those of the impersonated sender.  ( 3 min )
    Pseudo-Inverted Bottleneck Convolution for DARTS Search Space. (arXiv:2301.01286v1 [cs.LG])
    Differentiable Architecture Search (DARTS) has attracted considerable attention as a gradient-based Neural Architecture Search (NAS) method. Since the introduction of DARTS, there has been little work done on adapting the action space based on state-of-art architecture design principles for CNNs. In this work, we aim to address this gap by incrementally augmenting the DARTS search space with micro-design changes inspired by ConvNeXt and studying the trade-off between accuracy, evaluation layer count, and computational cost. To this end, we introduce the Pseudo-Inverted Bottleneck conv block intending to reduce the computational footprint of the inverted bottleneck block proposed in ConvNeXt. Our proposed architecture is much less sensitive to evaluation layer count and outperforms a DARTS network with similar size significantly, at layer counts as small as 2. Furthermore, with less layers, not only does it achieve higher accuracy with lower GMACs and parameter count, GradCAM comparisons show that our network is able to better detect distinctive features of target objects compared to DARTS.  ( 2 min )
    Hypernetworks for Zero-shot Transfer in Reinforcement Learning. (arXiv:2211.15457v2 [cs.LG] UPDATED)
    In this paper, hypernetworks are trained to generate behaviors across a range of unseen task conditions, via a novel TD-based training objective and data from a set of near-optimal RL solutions for training tasks. This work relates to meta RL, contextual RL, and transfer learning, with a particular focus on zero-shot performance at test time, enabled by knowledge of the task parameters (also known as context). Our technical approach is based upon viewing each RL algorithm as a mapping from the MDP specifics to the near-optimal value function and policy and seek to approximate it with a hypernetwork that can generate near-optimal value functions and policies, given the parameters of the MDP. We show that, under certain conditions, this mapping can be considered as a supervised learning problem. We empirically evaluate the effectiveness of our method for zero-shot transfer to new reward and transition dynamics on a series of continuous control tasks from DeepMind Control Suite. Our method demonstrates significant improvements over baselines from multitask and meta RL approaches.
    Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning. (arXiv:2301.00944v1 [cs.LG])
    In large-scale machine learning, recent works have studied the effects of compressing gradients in stochastic optimization in order to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in large-scale, multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? In this paper, we investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our main technical contribution is to show that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. We then extend our results significantly to nonlinear stochastic approximation algorithms and multi-agent settings. In particular, we prove that for multi-agent TD learning, one can achieve linear convergence speedups in the number of agents while communicating just $\tilde{O}(1)$ bits per agent at each time step. Our work is the first to provide finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our analysis hinges on studying the drift of a novel Lyapunov function that captures the dynamics of a memory variable introduced by error feedback.
    Heterogeneous Domain Adaptation and Equipment Matching: DANN-based Alignment with Cyclic Supervision (DBACS). (arXiv:2301.01038v1 [cs.LG])
    Process monitoring and control are essential in modern industries for ensuring high quality standards and optimizing production performance. These technologies have a long history of application in production and have had numerous positive impacts, but also hold great potential when integrated with Industry 4.0 and advanced machine learning, particularly deep learning, solutions. However, in order to implement these solutions in production and enable widespread adoption, the scalability and transferability of deep learning methods have become a focus of research. While transfer learning has proven successful in many cases, particularly with computer vision and homogenous data inputs, it can be challenging to apply to heterogeneous data. Motivated by the need to transfer and standardize established processes to different, non-identical environments and by the challenge of adapting to heterogeneous data representations, this work introduces the Domain Adaptation Neural Network with Cyclic Supervision (DBACS) approach. DBACS addresses the issue of model generalization through domain adaptation, specifically for heterogeneous data, and enables the transfer and scalability of deep learning-based statistical control methods in a general manner. Additionally, the cyclic interactions between the different parts of the model enable DBACS to not only adapt to the domains, but also match them. To the best of our knowledge, DBACS is the first deep learning approach to combine adaptation and matching for heterogeneous data settings. For comparison, this work also includes subspace alignment and a multi-view learning that deals with heterogeneous representations by mapping data into correlated latent feature spaces. Finally, DBACS with its ability to adapt and match, is applied to a virtual metrology use case for an etching process run on different machine types in semiconductor manufacturing.
    Finding the Most Transferable Tasks for Brain Image Segmentation. (arXiv:2301.00934v1 [eess.IV])
    Although many studies have successfully applied transfer learning to medical image segmentation, very few of them have investigated the selection strategy when multiple source tasks are available for transfer. In this paper, we propose a prior knowledge guided and transferability based framework to select the best source tasks among a collection of brain image segmentation tasks, to improve the transfer learning performance on the given target task. The framework consists of modality analysis, RoI (region of interest) analysis, and transferability estimation, such that the source task selection can be refined step by step. Specifically, we adapt the state-of-the-art analytical transferability estimation metrics to medical image segmentation tasks and further show that their performance can be significantly boosted by filtering candidate source tasks based on modality and RoI characteristics. Our experiments on brain matter, brain tumor, and white matter hyperintensities segmentation datasets reveal that transferring from different tasks under the same modality is often more successful than transferring from the same task under different modalities. Furthermore, within the same modality, transferring from the source task that has stronger RoI shape similarity with the target task can significantly improve the final transfer performance. And such similarity can be captured using the Structural Similarity index in the label space.
    Taming Lagrangian Chaos with Multi-Objective Reinforcement Learning. (arXiv:2212.09612v1 [physics.flu-dyn] CROSS LISTED)
    We consider the problem of two active particles in 2D complex flows with the multi-objective goals of minimizing both the dispersion rate and the energy consumption of the pair. We approach the problem by means of Multi Objective Reinforcement Learning (MORL), combining scalarization techniques together with a Q-learning algorithm, for Lagrangian drifters that have variable swimming velocity. We show that MORL is able to find a set of trade-off solutions forming an optimal Pareto frontier. As a benchmark, we show that a set of heuristic strategies are dominated by the MORL solutions. We consider the situation in which the agents cannot update their control variables continuously, but only after a discrete (decision) time, $\tau$. We show that there is a range of decision times, in between the Lyapunov time and the continuous updating limit, where Reinforcement Learning finds strategies that significantly improve over heuristics. In particular, we discuss how large decision times require enhanced knowledge of the flow, whereas for smaller $\tau$ all a priori heuristic strategies become Pareto optimal.
    Real-World Image Super Resolution via Unsupervised Bi-directional Cycle Domain Transfer Learning based Generative Adversarial Network. (arXiv:2211.10563v2 [cs.CV] UPDATED)
    Deep Convolutional Neural Networks (DCNNs) have exhibited impressive performance on image super-resolution tasks. However, these deep learning-based super-resolution methods perform poorly in real-world super-resolution tasks, where the paired high-resolution and low-resolution images are unavailable and the low-resolution images are degraded by complicated and unknown kernels. To break these limitations, we propose the Unsupervised Bi-directional Cycle Domain Transfer Learning-based Generative Adversarial Network (UBCDTL-GAN), which consists of an Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN) and the Semantic Encoder guided Super Resolution Network (SESRN). First, the UBCDTN is able to produce an approximated real-like LR image through transferring the LR image from an artificially degraded domain to the real-world LR image domain. Second, the SESRN has the ability to super-resolve the approximated real-like LR image to a photo-realistic HR image. Extensive experiments on unpaired real-world image benchmark datasets demonstrate that the proposed method achieves superior performance compared to state-of-the-art methods.
    Invariance-Aware Randomized Smoothing Certificates. (arXiv:2211.14207v2 [cs.LG] UPDATED)
    Building models that comply with the invariances inherent to different domains, such as invariance under translation or rotation, is a key aspect of applying machine learning to real world problems like molecular property prediction, medical imaging, protein folding or LiDAR classification. For the first time, we study how the invariances of a model can be leveraged to provably guarantee the robustness of its predictions. We propose a gray-box approach, enhancing the powerful black-box randomized smoothing technique with white-box knowledge about invariances. First, we develop gray-box certificates based on group orbits, which can be applied to arbitrary models with invariance under permutation and Euclidean isometries. Then, we derive provably tight gray-box certificates. We experimentally demonstrate that the provably tight certificates can offer much stronger guarantees, but that in practical scenarios the orbit-based method is a good approximation.
    Self-Supervised Monocular Depth Estimation: Solving the Edge-Fattening Problem. (arXiv:2210.00411v3 [cs.CV] UPDATED)
    Self-supervised monocular depth estimation (MDE) models universally suffer from the notorious edge-fattening issue. Triplet loss, as a widespread metric learning strategy, has largely succeeded in many computer vision applications. In this paper, we redesign the patch-based triplet loss in MDE to alleviate the ubiquitous edge-fattening issue. We show two drawbacks of the raw triplet loss in MDE and demonstrate our problem-driven redesigns. First, we present a min. operator based strategy applied to all negative samples, to prevent well-performing negatives sheltering the error of edge-fattening negatives. Second, we split the anchor-positive distance and anchor-negative distance from within the original triplet, which directly optimizes the positives without any mutual effect with the negatives. Extensive experiments show the combination of these two small redesigns can achieve unprecedented results: Our powerful and versatile triplet loss not only makes our model outperform all previous SoTA by a large margin, but also provides substantial performance boosts to a large number of existing models, while introducing no extra inference computation at all.
    Stochastic Langevin Differential Inclusions with Applications to Machine Learning. (arXiv:2206.11533v2 [math.OC] UPDATED)
    Stochastic differential equations of Langevin-diffusion form have received significant attention, thanks to their foundational role in both Bayesian sampling algorithms and optimization in machine learning. In the latter, they serve as a conceptual model of the stochastic gradient flow in training over-parametrized models. However, the literature typically assumes smoothness of the potential, whose gradient is the drift term. Nevertheless, there are many problems, for which the potential function is not continuously differentiable, and hence the drift is not Lipschitz continuous everywhere. This is exemplified by robust losses and Rectified Linear Units in regression problems. In this paper, we show some foundational results regarding the flow and asymptotic properties of Langevin-type Stochastic Differential Inclusions under assumptions appropriate to the machine-learning settings. In particular, we show strong existence of the solution, as well as asymptotic minimization of the canonical free-energy functional.
    UMIX: Improving Importance Weighting for Subpopulation Shift via Uncertainty-Aware Mixup. (arXiv:2209.08928v3 [cs.LG] UPDATED)
    Subpopulation shift widely exists in many real-world machine learning applications, referring to the training and test distributions containing the same subpopulation groups but varying in subpopulation frequencies. Importance reweighting is a normal way to handle the subpopulation shift issue by imposing constant or adaptive sampling weights on each sample in the training dataset. However, some recent studies have recognized that most of these approaches fail to improve the performance over empirical risk minimization especially when applied to over-parameterized neural networks. In this work, we propose a simple yet practical framework, called uncertainty-aware mixup (UMIX), to mitigate the overfitting issue in over-parameterized models by reweighting the ''mixed'' samples according to the sample uncertainty. The training-trajectories-based uncertainty estimation is equipped in the proposed UMIX for each sample to flexibly characterize the subpopulation distribution. We also provide insightful theoretical analysis to verify that UMIX achieves better generalization bounds over prior works. Further, we conduct extensive empirical studies across a wide range of tasks to validate the effectiveness of our method both qualitatively and quantitatively. Code is available at https://github.com/TencentAILabHealthcare/UMIX.
    Catch Me If You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments with Moving Sounds. (arXiv:2111.14843v4 [cs.SD] UPDATED)
    Audio-visual navigation combines sight and hearing to navigate to a sound-emitting source in an unmapped environment. While recent approaches have demonstrated the benefits of audio input to detect and find the goal, they focus on clean and static sound sources and struggle to generalize to unheard sounds. In this work, we propose the novel dynamic audio-visual navigation benchmark which requires catching a moving sound source in an environment with noisy and distracting sounds, posing a range of new challenges. We introduce a reinforcement learning approach that learns a robust navigation policy for these complex settings. To achieve this, we propose an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals. We demonstrate that our approach consistently outperforms the current state-of-the-art by a large margin across all tasks of moving sounds, unheard sounds, and noisy environments, on two challenging 3D scanned real-world environments, namely Matterport3D and Replica. The benchmark is available at this http URL  ( 2 min )
    Semantic Encoder Guided Generative Adversarial Face Ultra-Resolution Network. (arXiv:2211.10532v2 [cs.CV] UPDATED)
    Face super-resolution is a domain-specific image super-resolution, which aims to generate High-Resolution (HR) face images from their Low-Resolution (LR) counterparts. In this paper, we propose a novel face super-resolution method, namely Semantic Encoder guided Generative Adversarial Face Ultra-Resolution Network (SEGA-FURN) to ultra-resolve an unaligned tiny LR face image to its HR counterpart with multiple ultra-upscaling factors (e.g., 4x and 8x). The proposed network is composed of a novel semantic encoder that has the ability to capture the embedded semantics to guide adversarial learning and a novel generator that uses a hierarchical architecture named Residual in Internal Dense Block (RIDB). Moreover, we propose a joint discriminator which discriminates both image data and embedded semantics. The joint discriminator learns the joint probability distribution of the image space and latent space. We also use a Relativistic average Least Squares loss (RaLS) as the adversarial loss to alleviate the gradient vanishing problem and enhance the stability of the training procedure. Extensive experiments on large face datasets have proved that the proposed method can achieve superior super-resolution results and significantly outperform other state-of-the-art methods in both qualitative and quantitative comparisons.
    Adversarial Self-Attention for Language Understanding. (arXiv:2206.12608v2 [cs.CL] UPDATED)
    Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut'' between the labels and inputs, thus impairing the generalization and robustness. This paper advances self-attention mechanism to its robust variant for Transformer-based pre-trained language models (e.g. BERT). We propose \textit{Adversarial Self-Attention} mechanism (ASA), which adversarially biases the attentions to effectively suppress the model reliance on features (e.g. specific keywords) and encourage its exploration of broader semantics. We conduct comprehensive evaluation across a wide range of tasks for both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gain compared to naive training for longer steps. For fine-tuning, ASA-empowered models outweigh naive models by a large margin considering both generalization and robustness.
    Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. (arXiv:2206.13211v2 [cs.LG] UPDATED)
    The recent work ``Combinatorial Optimization with Physics-Inspired Graph Neural Networks'' [Nat Mach Intell 4 (2022) 367] introduces a physics-inspired unsupervised Graph Neural Network (GNN) to solve combinatorial optimization problems on sparse graphs. To test the performances of these GNNs, the authors of the work show numerical results for two fundamental problems: maximum cut and maximum independent set (MIS). They conclude that "the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables." In this comment, we show that a simple greedy algorithm, running in almost linear time, can find solutions for the MIS problem of much better quality than the GNN. The greedy algorithm is faster by a factor of $10^4$ with respect to the GNN for problems with a million variables. We do not see any good reason for solving the MIS with these GNN, as well as for using a sledgehammer to crack nuts. In general, many claims of superiority of neural networks in solving combinatorial problems are at risk of being not solid enough, since we lack standard benchmarks based on really hard problems. We propose one of such hard benchmarks, and we hope to see future neural network optimizers tested on these problems before any claim of superiority is made.
    Optimal transport with $f$-divergence regularization and generalized Sinkhorn algorithm. (arXiv:2105.14337v2 [math.OC] UPDATED)
    Entropic regularization provides a generalization of the original optimal transport problem. It introduces a penalty term defined by the Kullback-Leibler divergence, making the problem more tractable via the celebrated Sinkhorn algorithm. Replacing the Kullback-Leibler divergence with a general $f$-divergence leads to a natural generalization. The case of divergences defined by superlinear functions was recently studied by Di Marino and Gerolin. Using convex analysis, we extend the theory developed so far to include all $f$-divergences defined by functions of Legendre type, and prove that under some mild conditions, strong duality holds, optimums in both the primal and dual problems are attained, the generalization of the $c$-transform is well-defined, and we give sufficient conditions for the generalized Sinkhorn algorithm to converge to an optimal solution. We propose a practical algorithm for computing an approximate solution of the optimal transport problem with $f$-divergence regularization via the generalized Sinkhorn algorithm. Finally, we present experimental results on synthetic 2-dimensional data, demonstrating the effects of using different $f$-divergences for regularization, which influences convergence speed, numerical stability and sparsity of the optimal coupling.  ( 2 min )
    Efficient Quantized Sparse Matrix Operations on Tensor Cores. (arXiv:2209.06979v3 [cs.DC] UPDATED)
    The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.
    A Worker-Task Specialization Model for Crowdsourcing: Efficient Inference and Fundamental Limits. (arXiv:2111.12550v2 [cs.HC] UPDATED)
    Crowdsourcing system has emerged as an effective platform for labeling data with relatively low cost by using non-expert workers. Inferring correct labels from multiple noisy answers on data, however, has been a challenging problem, since the quality of the answers varies widely across tasks and workers. Many existing works have assumed that there is a fixed ordering of workers in terms of their skill levels, and focused on estimating worker skills to aggregate the answers from workers with different weights. In practice, however, the worker skill changes widely across tasks, especially when the tasks are heterogeneous. In this paper, we consider a new model, called $d$-type specialization model, in which each task and worker has its own (unknown) type and the reliability of each worker can vary in the type of a given task and that of a worker. We allow that the number $d$ of types can scale in the number of tasks. In this model, we characterize the optimal sample complexity to correctly infer the labels within any given accuracy, and propose label inference algorithms achieving the order-wise optimal limit even when the types of tasks or those of workers are unknown. We conduct experiments both on synthetic and real datasets, and show that our algorithm outperforms the existing algorithms developed based on more strict model assumptions.  ( 2 min )
    A Machine Learning Surrogate Modeling Benchmark for Temperature Field Reconstruction of Heat-Source Systems. (arXiv:2108.08298v5 [cs.LG] UPDATED)
    Temperature field reconstruction of heat source systems (TFR-HSS) with limited monitoring sensors occurred in thermal management plays an important role in real time health detection system of electronic equipment in engineering. However, prior methods with common interpolations usually cannot provide accurate reconstruction performance as required. In addition, there exists no public dataset for widely research of reconstruction methods to further boost the reconstruction performance and engineering applications. To overcome this problem, this work develops a machine learning modelling benchmark for TFR-HSS task. First, the TFR-HSS task is mathematically modelled from real-world engineering problem and four types of numerically modellings have been constructed to transform the problem into discrete mapping forms. Then, this work proposes a set of machine learning modelling methods, including the general machine learning methods and the deep learning methods, to advance the state-of-the-art methods over temperature field reconstruction. More importantly, this work develops a novel benchmark dataset, namely Temperature Field Reconstruction Dataset (TFRD), to evaluate these machine learning modelling methods for the TFR-HSS task. Finally, a performance analysis of typical methods is given on TFRD, which can be served as the baseline results on this benchmark.  ( 2 min )
    Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles. (arXiv:2110.12067v2 [stat.ML] UPDATED)
    Gaussian graphical models provide a powerful framework for uncovering conditional dependence relationships between sets of nodes; they have found applications in a wide variety of fields including sensor and communication networks, physics, finance, and computational biology. Often, one observes data on the nodes and the task is to learn the graph structure, or perform graphical model selection. While this is a well-studied problem with many popular techniques, there are typically three major practical challenges: i) many existing algorithms become computationally intractable in huge-data settings with tens of thousands of nodes; ii) the need for separate data-driven hyperparameter tuning considerably adds to the computational burden; iii) the statistical accuracy of selected edges often deteriorates as the dimension and/or the complexity of the underlying graph structures increase. We tackle these problems by developing the novel Minipatch Graph (MPGraph) estimator. Our approach breaks up the huge graph learning problem into many smaller problems by creating an ensemble of tiny random subsets of both the observations and the nodes, termed minipatches. We then leverage recent advances that use hard thresholding to solve the latent variable graphical model problem to consistently learn the graph on each minipatch. Our approach is computationally fast, embarrassingly parallelizable, memory efficient, and has integrated stability-based hyperparamter tuning. Additionally, we prove that under weaker assumptions than that of the Graphical Lasso, our MPGraph estimator achieves graph selection consistency. We compare our approach to state-of-the-art computational approaches for Gaussian graphical model selection including the BigQUIC algorithm, and empirically demonstrate that our approach is not only more statistically accurate but also extensively faster for huge graph learning problems.  ( 3 min )
    Independence Testing for Bounded Degree Bayesian Network. (arXiv:2204.08690v2 [cs.DS] UPDATED)
    We study the following independence testing problem: given access to samples from a distribution $P$ over $\{0,1\}^n$, decide whether $P$ is a product distribution or whether it is $\varepsilon$-far in total variation distance from any product distribution. For arbitrary distributions, this problem requires $\exp(n)$ samples. We show in this work that if $P$ has a sparse structure, then in fact only linearly many samples are required. Specifically, if $P$ is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by $d$, then $\tilde{\Theta}(2^{d/2}\cdot n/\varepsilon^2)$ samples are necessary and sufficient for independence testing.
    PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning. (arXiv:2205.06401v3 [cs.CR] UPDATED)
    Contrastive learning pre-trains an image encoder using a large amount of unlabeled data such that the image encoder can be used as a general-purpose feature extractor for various downstream tasks. In this work, we propose PoisonedEncoder, a data poisoning attack to contrastive learning. In particular, an attacker injects carefully crafted poisoning inputs into the unlabeled pre-training data, such that the downstream classifiers built based on the poisoned encoder for multiple target downstream tasks simultaneously classify attacker-chosen, arbitrary clean inputs as attacker-chosen, arbitrary classes. We formulate our data poisoning attack as a bilevel optimization problem, whose solution is the set of poisoning inputs; and we propose a contrastive-learning-tailored method to approximately solve it. Our evaluation on multiple datasets shows that PoisonedEncoder achieves high attack success rates while maintaining the testing accuracy of the downstream classifiers built upon the poisoned encoder for non-attacker-chosen inputs. We also evaluate five defenses against PoisonedEncoder, including one pre-processing, three in-processing, and one post-processing defenses. Our results show that these defenses can decrease the attack success rate of PoisonedEncoder, but they also sacrifice the utility of the encoder or require a large clean pre-training dataset.
    Adaptive Sampling for Discovery. (arXiv:2205.14829v3 [stat.ML] UPDATED)
    In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.
    Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). (arXiv:2203.13366v7 [cs.IR] UPDATED)
    For a long time, different recommendation tasks typically require designing task-specific architectures and training objectives. As a result, it is hard to transfer the learned knowledge and representations from one task to another, thus restricting the generalization ability of existing recommendation approaches, e.g., a sequential recommendation model can hardly be applied or transferred to a review generation method. To deal with such issues, considering that language can describe almost anything and language grounding is a powerful medium to represent various problems or tasks, we present a flexible and unified text-to-text paradigm called "Pretrain, Personalized Prompt, and Predict Paradigm" (P5) for recommendation, which unifies various recommendation tasks in a shared framework. In P5, all data such as user-item interactions, user descriptions, item metadata, and user reviews are converted to a common format -- natural language sequences. The rich information from natural language assists P5 to capture deeper semantics for personalization and recommendation. Specifically, P5 learns different tasks with the same language modeling objective during pretraining. Thus, it serves as the foundation model for various downstream recommendation tasks, allows easy integration with other modalities, and enables instruction-based recommendation based on prompts. P5 advances recommender systems from shallow model to deep model to big model, and will revolutionize the technical form of recommender systems towards universal recommendation engine. With adaptive personalized prompt for different users, P5 is able to make predictions in a zero-shot or few-shot manner and largely reduces the necessity for extensive fine-tuning. On several recommendation benchmarks, we conduct experiments to show the effectiveness of P5. We release the source code at https://github.com/jeykigung/P5.
    How and Why to Manipulate Your Own Agent: On the Incentives of Users of Learning Agents. (arXiv:2112.07640v4 [cs.GT] UPDATED)
    The usage of automated learning agents is becoming increasingly prevalent in many online economic applications such as online auctions and automated trading. Motivated by such applications, this paper is dedicated to fundamental modeling and analysis of the strategic situations that the users of automated learning agents are facing. We consider strategic settings where several users engage in a repeated online interaction, assisted by regret-minimizing learning agents that repeatedly play a "game" on their behalf. We propose to view the outcomes of the agents' dynamics as inducing a "meta-game" between the users. Our main focus is on whether users can benefit in this meta-game from "manipulating" their own agents by misreporting their parameters to them. We define a general framework to model and analyze these strategic interactions between users of learning agents for general games and analyze the equilibria induced between the users in three classes of games. We show that, generally, users have incentives to misreport their parameters to their own agents, and that such strategic user behavior can lead to very different outcomes than those anticipated by standard analysis.
    Adaptive Discriminative Regularization for Visual Classification. (arXiv:2203.00833v2 [cs.LG] UPDATED)
    How to improve discriminative feature learning is central in classification. Existing works address this problem by explicitly increasing inter-class separability and intra-class similarity, whether by constructing positive and negative pairs for contrastive learning or posing tighter class separating margins. These methods do not exploit the similarity between different classes as they adhere to i.i.d. assumption in data. In this paper, we embrace the real-world data distribution setting that some classes share semantic overlaps due to their similar appearances or concepts. Regarding this hypothesis, we propose a novel regularization to improve discriminative learning. We first calibrate the estimated highest likelihood of one sample based on its semantically neighboring classes, then encourage the overall likelihood predictions to be deterministic by imposing an adaptive exponential penalty. As the gradient of the proposed method is roughly proportional to the uncertainty of the predicted likelihoods, we name it adaptive discriminative regularization (ADR), trained along with a standard cross entropy loss in classification. Extensive experiments demonstrate that it can yield consistent and non-trivial performance improvements in a variety of visual classification tasks (over 10 benchmarks). Furthermore, we find it is robust to long-tailed and noisy label data distribution. Its flexible design enables its compatibility with mainstream classification architectures and losses.
    TurbuGAN: An Adversarial Learning Approach to Spatially-Varying Multiframe Blind Deconvolution with Applications to Imaging Through Turbulence. (arXiv:2203.06764v3 [cs.CV] UPDATED)
    We present a self-supervised and self-calibrating multi-shot approach to imaging through atmospheric turbulence, called TurbuGAN. Our approach requires no paired training data, adapts itself to the distribution of the turbulence, leverages domain-specific data priors, and can generalize from tens to thousands of measurements. We achieve such functionality through an adversarial sensing framework adapted from CryoGAN, which uses a discriminator network to match the distributions of captured and simulated measurements. Our framework builds on CryoGAN by (1) generalizing the forward measurement model to incorporate physically accurate and computationally efficient models for light propagation through anisoplanatic turbulence, (2) enabling adaptation to slightly misspecified forward models, and (3) leveraging domain-specific prior knowledge using pretrained generative networks, when available. We validate TurbuGAN on both computationally simulated and experimentally captured images distorted with anisoplanatic turbulence.
    Probabilistic Approach for Road-Users Detection. (arXiv:2112.01360v2 [cs.CV] UPDATED)
    Object detection in autonomous driving applications implies that the detection and tracking of semantic objects are commonly native to urban driving environments, as pedestrians and vehicles. One of the major challenges in state-of-the-art deep-learning based object detection is false positive which occurrences with overconfident scores. This is highly undesirable in autonomous driving and other critical robotic-perception domains because of safety concerns. This paper proposes an approach to alleviate the problem of overconfident predictions by introducing a novel probabilistic layer to deep object detection networks in testing. The suggested approach avoids the traditional Sigmoid or Softmax prediction layer which often produces overconfident predictions. It is demonstrated that the proposed technique reduces overconfidence in the false positives without degrading the performance on the true positives. The approach is validated on the 2D-KITTI objection detection through the YOLOV4 and SECOND (Lidar-based detector). The proposed approach enables enabling interpretable probabilistic predictions without the requirement of re-training the network and therefore is very practical.
    Off-Policy Evaluation in Embedded Spaces. (arXiv:2203.02807v2 [cs.LG] UPDATED)
    Off-policy evaluation methods are important in recommendation systems and search engines, where data collected under an existing logging policy is used to estimate the performance of a new proposed policy. A common approach to this problem is weighting, where data is weighted by a density ratio between the probability of actions given contexts in the target and logged policies. In practice, two issues often arise. First, many problems have very large action spaces and we may not observe rewards for most actions, and so in finite samples we may encounter a positivity violation. Second, many recommendation systems are not probabilistic and so having access to logging and target policy densities may not be feasible. To address these issues, we introduce the featurized embedded permutation weighting estimator. The estimator computes the density ratio in an action embedding space, which reduces the possibility of positivity violations. The density ratio is computed leveraging recent advances in normalizing flows and density ratio estimation as a classification problem, in order to obtain estimates which are feasible in practice.
    Learning under Storage and Privacy Constraints. (arXiv:2202.02892v2 [cs.IT] UPDATED)
    Storage-efficient privacy-preserving learning is crucial due to the increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
    A Survey on the Robustness of Feature Importance and Counterfactual Explanations. (arXiv:2111.00358v2 [cs.LG] UPDATED)
    There exist several methods that aim to address the crucial task of understanding the behaviour of AI/ML models. Arguably, the most popular among them are local explanations that focus on investigating model behaviour for individual instances. Several methods have been proposed for local analysis, but relatively lesser effort has gone into understanding if the explanations are robust and accurately reflect the behaviour of underlying models. In this work, we present a survey of the works that analysed the robustness of two classes of local explanations (feature importance and counterfactual explanations) that are popularly used in analysing AI/ML models in finance. The survey aims to unify existing definitions of robustness, introduces a taxonomy to classify different robustness approaches, and discusses some interesting results. Finally, the survey introduces some pointers about extending current robustness analysis approaches so as to identify reliable explainability methods.
    Transferable Energy Storage Bidder. (arXiv:2301.01233v1 [cs.LG])
    Energy storage resources must consider both price uncertainties and their physical operating characteristics when participating in wholesale electricity markets. This is a challenging problem as electricity prices are highly volatile, and energy storage has efficiency losses, power, and energy constraints. This paper presents a novel, versatile, and transferable approach combining model-based optimization with a convolutional long short-term memory network for energy storage to respond to or bid into wholesale electricity markets. We apply transfer learning to the ConvLSTM network to quickly adapt the trained bidding model to new market environments. We test our proposed approach using historical prices from New York State, showing it achieves state-of-the-art results, achieving between 70% to near 90% profit ratio compared to perfect foresight cases, in both price response and wholesale market bidding setting with various energy storage durations. We also test a transfer learning approach by pre-training the bidding model using New York data and applying it to arbitrage in Queensland, Australia. The result shows transfer learning achieves exceptional arbitrage profitability with as little as three days of local training data, demonstrating its significant advantage over training from scratch in scenarios with very limited data availability.
    DMOps: Data Management Operation and Recipes. (arXiv:2301.01228v1 [cs.DB])
    Data-centric AI has shed light on the significance of data within the machine learning (ML) pipeline. Acknowledging its importance, various research and policies are suggested by academia, industry, and government departments. Although the capability of utilizing existing data is essential, the capability to build a dataset has become more important than ever. In consideration of this trend, we propose a "Data Management Operation and Recipes" that will guide the industry regardless of the task or domain. In other words, this paper presents the concept of DMOps derived from real-world experience. By offering a baseline for building data, we want to help the industry streamline its data operation optimally.
    Decentralized cooperative perception for autonomous vehicles: Learning to value the unknown. (arXiv:2301.01250v1 [cs.LG])
    Recently, we have been witnesses of accidents involving autonomous vehicles and their lack of sufficient information. One way to tackle this issue is to benefit from the perception of different view points, namely cooperative perception. We propose here a decentralized collaboration, i.e. peer-to-peer, in which the agents are active in their quest for full perception by asking for specific areas in their surroundings on which they would like to know more. Ultimately, we want to optimize a trade-off between the maximization of knowledge about moving objects and the minimization of the total volume of information received from others, to limit communication costs and message processing time. For this, we propose a way to learn a communication policy that reverses the usual communication paradigm by only requesting from other vehicles what is unknown to the ego-vehicle, instead of filtering on the sender side. We tested three different generative models to be taken as base for a Deep Reinforcement Learning (DRL) algorithm, and compared them to a broadcasting policy and a policy randomly selecting areas. In particular, we propose Locally Predictable VAE (LP-VAE), which appears to be producing better belief states for predictions than state-of-the-art models, both as a standalone model and in the context of DRL. Experiments were conducted in the driving simulator CARLA. Our best models reached on average a gain of 25% of the total complementary information, while only requesting about 5% of the ego-vehicle's perceptual field. This trade-off is adjustable through the interpretable hyperparameters of our reward function.
    Activity Detection for Grant-Free NOMA in Massive IoT Networks. (arXiv:2301.01274v1 [eess.SP])
    Recently, grant-free transmission paradigm has been introduced for massive Internet of Things (IoT) networks to save both time and bandwidth and transmit the message with low latency. In order to accurately decode the message of each device at the base station (BS), first, the active devices at each transmission frame must be identified. In this work, first we investigate the problem of activity detection as a threshold comparing problem. We show the convexity of the activity detection method through analyzing its probability of error which makes it possible to find the optimal threshold for minimizing the activity detection error. Consequently, to achieve an optimum solution, we propose a deep learning (DL)-based method called convolutional neural network (CNN)-activity detection (AD). In order to make it more practical, we consider unknown and time-varying activity rate for the IoT devices. Our simulations verify that our proposed CNN-AD method can achieve higher performance compared to the existing non-Bayesian greedy-based methods. This is while existing methods need to know the activity rate of IoT devices, while our method works for unknown and even time-varying activity rates
    Linear chain conditional random fields, hidden Markov models, and related classifiers. (arXiv:2301.01293v1 [stat.ML])
    Practitioners use Hidden Markov Models (HMMs) in different problems for about sixty years. Besides, Conditional Random Fields (CRFs) are an alternative to HMMs and appear in the literature as different and somewhat concurrent models. We propose two contributions. First, we show that basic Linear-Chain CRFs (LC-CRFs), considered as different from the HMMs, are in fact equivalent to them in the sense that for each LC-CRF there exists a HMM - that we specify - whom posterior distribution is identical to the given LC-CRF. Second, we show that it is possible to reformulate the generative Bayesian classifiers Maximum Posterior Mode (MPM) and Maximum a Posteriori (MAP) used in HMMs, as discriminative ones. The last point is of importance in many fields, especially in Natural Language Processing (NLP), as it shows that in some situations dropping HMMs in favor of CRFs was not necessary.
    A Tutorial on Parametric Variational Inference. (arXiv:2301.01236v1 [stat.ML])
    Variational inference uses optimization, rather than integration, to approximate the marginal likelihood, and thereby the posterior, in a Bayesian model. Thanks to advances in computational scalability made in the last decade, variational inference is now the preferred choice for many high-dimensional models and large datasets. This tutorial introduces variational inference from the parametric perspective that dominates these recent developments, in contrast to the mean-field perspective commonly found in other introductory texts.
    ExploreADV: Towards exploratory attack for Neural Networks. (arXiv:2301.01223v1 [cs.CR])
    Although deep learning has made remarkable progress in processing various types of data such as images, text and speech, they are known to be susceptible to adversarial perturbations: perturbations specifically designed and added to the input to make the target model produce erroneous output. Most of the existing studies on generating adversarial perturbations attempt to perturb the entire input indiscriminately. In this paper, we propose ExploreADV, a general and flexible adversarial attack system that is capable of modeling regional and imperceptible attacks, allowing users to explore various kinds of adversarial examples as needed. We adapt and combine two existing boundary attack methods, DeepFool and Brendel\&Bethge Attack, and propose a mask-constrained adversarial attack system, which generates minimal adversarial perturbations under the pixel-level constraints, namely ``mask-constraints''. We study different ways of generating such mask-constraints considering the variance and importance of the input features, and show that our adversarial attack system offers users good flexibility to focus on sub-regions of inputs, explore imperceptible perturbations and understand the vulnerability of pixels/regions to adversarial attacks. We demonstrate our system to be effective based on extensive experiments and user study.
    Assessment of creditworthiness models privacy-preserving training with synthetic data. (arXiv:2301.01212v1 [q-fin.RM])
    Credit scoring models are the primary instrument used by financial institutions to manage credit risk. The scarcity of research on behavioral scoring is due to the difficult data access. Financial institutions have to maintain the privacy and security of borrowers' information refrain them from collaborating in research initiatives. In this work, we present a methodology that allows us to evaluate the performance of models trained with synthetic data when they are applied to real-world data. Our results show that synthetic data quality is increasingly poor when the number of attributes increases. However, creditworthiness assessment models trained with synthetic data show a reduction of 3\% of AUC and 6\% of KS when compared with models trained with real data. These results have a significant impact since they encourage credit risk investigation from synthetic data, making it possible to maintain borrowers' privacy and to address problems that until now have been hampered by the availability of information.
    Machine Learning Approach to Polymerization Reaction Engineering: Determining Monomers Reactivity Ratios. (arXiv:2301.01231v1 [cs.LG])
    Here, we demonstrate how machine learning enables the prediction of comonomers reactivity ratios based on the molecular structure of monomers. We combined multi-task learning, multi-inputs, and Graph Attention Network to build a model capable of predicting reactivity ratios based on the monomers chemical structures.
    Common Practices and Taxonomy in Deep Multi-view Fusion for Remote Sensing Applications. (arXiv:2301.01200v1 [cs.CV])
    The advances in remote sensing technologies have boosted applications for Earth observation. These technologies provide multiple observations or views with different levels of information. They might contain static or temporary views with different levels of resolution, in addition to having different types and amounts of noise due to sensor calibration or deterioration. A great variety of deep learning models have been applied to fuse the information from these multiple views, known as deep multi-view or multi-modal fusion learning. However, the approaches in the literature vary greatly since different terminology is used to refer to similar concepts or different illustrations are given to similar techniques. This article gathers works on multi-view fusion for Earth observation by focusing on the common practices and approaches used in the literature. We summarize and structure insights from several different publications concentrating on unifying points and ideas. In this manuscript, we provide a harmonized terminology while at the same time mentioning the various alternative terms that are used in literature. The topics covered by the works reviewed focus on supervised learning with the use of neural network models. We hope this review, with a long list of recent references, can support future research and lead to a unified advance in the area.
    A Multi-Source Information Learning Framework for Airbnb Price Prediction. (arXiv:2301.01222v1 [cs.LG])
    With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.
    Task-Guided IRL in POMDPs that Scales. (arXiv:2301.01219v1 [cs.LG])
    In inverse reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts. However, many existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). We address two limitations of existing IRL techniques. First, they require an excessive amount of data due to the information asymmetry between the expert and the learner. Second, most of these IRL techniques require solving the computationally intractable forward problem -- computing an optimal policy given a reward function -- in POMDPs. The developed algorithm reduces the information asymmetry while increasing the data efficiency by incorporating task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations. Further, the algorithm avoids a common source of algorithmic complexity by building on causal entropy as the measure of the likelihood of the demonstrations as opposed to entropy. Nevertheless, the resulting problem is nonconvex due to the so-called forward problem. We solve the intrinsic nonconvexity of the forward problem in a scalable manner through a sequential linear programming scheme that guarantees to converge to a locally optimal policy. In a series of examples, including experiments in a high-fidelity Unity simulator, we demonstrate that even with a limited amount of data and POMDPs with tens of thousands of states, our algorithm learns reward functions and policies that satisfy the task while inducing similar behavior to the expert by leveraging the provided side information.
    Offline Reinforcement Learning with Differential Privacy. (arXiv:2206.00810v2 [cs.LG] UPDATED)
    The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
    TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. (arXiv:2112.02052v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs), as the backbone of graph-based machine learning, demonstrate great success in various domains (e.g., e-commerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graph-based operations. To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the state-of-the-art Deep Graph Library framework across various GNN models and dataset settings.
    An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation. (arXiv:2301.01224v1 [cs.SE])
    Existing automated techniques for software documentation typically attempt to reason between two main sources of information: code and natural language. However, this reasoning process is often complicated by the lexical gap between more abstract natural language and more structured programming languages. One potential bridge for this gap is the Graphical User Interface (GUI), as GUIs inherently encode salient information about underlying program functionality into rich, pixel-based data representations. This paper offers one of the first comprehensive empirical investigations into the connection between GUIs and functional, natural language descriptions of software. First, we collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications. The descriptions were obtained from human labelers and underwent several quality control mechanisms. To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input. We evaluate these models quantitatively, using common machine translation metrics, and qualitatively through a large-scale user study. Finally, we offer learned lessons and a discussion of the potential shown by multimodal models to enhance future techniques for automated software documentation.
    Comparison of machine learning algorithms for merging gridded satellite and earth-observed precipitation data. (arXiv:2301.01252v1 [physics.ao-ph])
    Gridded satellite precipitation datasets are useful in hydrological applications as they cover large regions with high density. However, they are not accurate in the sense that they do not agree with ground-based measurements. An established means for improving their accuracy is to correct them by adopting machine learning algorithms. The problem is defined as a regression setting, in which the ground-based measurements have the role of the dependent variable and the satellite data are the predictor variables, together with topography factors (e.g., elevation). Most studies of this kind involve a limited number of machine learning algorithms, and are conducted at a small region and for a limited time period. Thus, the results obtained through them are of local importance and do not provide more general guidance and best practices. To provide results that are generalizable and to contribute to the delivery of best practices, we here compare eight state-of-the-art machine learning algorithms in correcting satellite precipitation data for the entire contiguous United States and for a 15-year period. We use monthly data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset, together with monthly earth-observed precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm). The results suggest that extreme gradient boosting (XGBoost) and random forests are the most accurate in terms of the squared error scoring function. The remaining algorithms can be ordered as follows from the best to the worst ones: Bayesian regularized feed-forward neural networks, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks and linear regression.
    On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control. (arXiv:2106.08414v2 [cs.LG] UPDATED)
    Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.
    Comparison of tree-based ensemble algorithms for merging satellite and earth-observed precipitation data at the daily time scale. (arXiv:2301.01214v1 [cs.LG])
    Merging satellite products and ground-based measurements is often required for obtaining precipitation datasets that simultaneously cover large regions with high density and are more accurate than pure satellite precipitation products. Machine and statistical learning regression algorithms are regularly utilized in this endeavour. At the same time, tree-based ensemble algorithms for regression are adopted in various fields for solving algorithmic problems with high accuracy and low computational cost. The latter can constitute a crucial factor for selecting algorithms for satellite precipitation product correction at the daily and finer time scales, where the size of the datasets is particularly large. Still, information on which tree-based ensemble algorithm to select in such a case for the contiguous United States (US) is missing from the literature. In this work, we conduct an extensive comparison between three tree-based ensemble algorithms, specifically random forests, gradient boosting machines (gbm) and extreme gradient boosting (XGBoost), in the context of interest. We use daily data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and the IMERG (Integrated Multi-satellitE Retrievals for GPM) gridded datasets. We also use earth-observed precipitation data from the Global Historical Climatology Network daily (GHCNd) database. The experiments refer to the entire contiguous US and additionally include the application of the linear regression algorithm for benchmarking purposes. The results suggest that XGBoost is the best-performing tree-based ensemble algorithm among those compared. They also suggest that IMERG is more useful than PERSIANN in the context investigated.
    Estimating Categorical Counterfactuals via Deep Twin Networks. (arXiv:2109.01904v5 [cs.LG] UPDATED)
    Counterfactual inference is a powerful tool, capable of solving challenging problems in high-profile sectors. To perform counterfactual inference, one requires knowledge of the underlying causal mechanisms. However, causal mechanisms cannot be uniquely determined from observations and interventions alone. This raises the question of how to choose the causal mechanisms so that resulting counterfactual inference is trustworthy in a given domain. This question has been addressed in causal models with binary variables, but the case of categorical variables remains unanswered. We address this challenge by introducing for causal models with categorical variables the notion of counterfactual ordering, a principle that posits desirable properties causal mechanisms should posses, and prove that it is equivalent to specific functional constraints on the causal mechanisms. To learn causal mechanisms satisfying these constraints, and perform counterfactual inference with them, we introduce deep twin networks. These are deep neural networks that, when trained, are capable of twin network counterfactual inference -- an alternative to the abduction, action, & prediction method. We empirically test our approach on diverse real-world and semi-synthetic data from medicine, epidemiology, and finance, reporting accurate estimation of counterfactual probabilities while demonstrating the issues that arise with counterfactual reasoning when counterfactual ordering is not enforced.
    Deep Learning for bias-correcting comprehensive high-resolution Earth system models. (arXiv:2301.01253v1 [physics.ao-ph])
    The accurate representation of precipitation in Earth system models (ESMs) is crucial for reliable projections of the ecological and socioeconomic impacts in response to anthropogenic global warming. The complex cross-scale interactions of processes that produce precipitation are challenging to model, however, inducing potentially strong biases in ESM fields, especially regarding extremes. State-of-the-art bias correction methods only address errors in the simulated frequency distributions locally, at every individual grid cell. Improving unrealistic spatial patterns of the ESM output, which would require spatial context, has not been possible so far. Here, we show that a post-processing method based on physically constrained generative adversarial networks (GANs) can correct biases of a state-of-the-art, CMIP6-class ESM both in local frequency distributions and in the spatial patterns at once. While our method improves local frequency distributions equally well as gold-standard bias-adjustment frameworks it strongly outperforms any existing methods in the correction of spatial patterns, especially in terms of the characteristic spatial intermittency of precipitation extremes.
    Understanding Imbalanced Semantic Segmentation Through Neural Collapse. (arXiv:2301.01100v1 [cs.CV])
    A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
    DGNet: Distribution Guided Efficient Learning for Oil Spill Image Segmentation. (arXiv:2301.01202v1 [cs.CV])
    Successful implementation of oil spill segmentation in Synthetic Aperture Radar (SAR) images is vital for marine environmental protection. In this paper, we develop an effective segmentation framework named DGNet, which performs oil spill segmentation by incorporating the intrinsic distribution of backscatter values in SAR images. Specifically, our proposed segmentation network is constructed with two deep neural modules running in an interactive manner, where one is the inference module to achieve latent feature variable inference from SAR images, and the other is the generative module to produce oil spill segmentation maps by drawing the latent feature variables as inputs. Thus, to yield accurate segmentation, we take into account the intrinsic distribution of backscatter values in SAR images and embed it in our segmentation model. The intrinsic distribution originates from SAR imagery, describing the physical characteristics of oil spills. In the training process, the formulated intrinsic distribution guides efficient learning of optimal latent feature variable inference for oil spill segmentation. The efficient learning enables the training of our proposed DGNet with a small amount of image data. This is economically beneficial to oil spill segmentation where the availability of oil spill SAR image data is limited in practice. Additionally, benefiting from optimal latent feature variable inference, our proposed DGNet performs accurate oil spill segmentation. We evaluate the segmentation performance of our proposed DGNet with different metrics, and experimental evaluations demonstrate its effective segmentations.  ( 2 min )
    Mutual Information Regularization for Vertical Federated Learning. (arXiv:2301.01142v1 [cs.LG])
    Vertical Federated Learning (VFL) is widely utilized in real-world applications to enable collaborative learning while protecting data privacy and safety. However, previous works show that parties without labels (passive parties) in VFL can infer the sensitive label information owned by the party with labels (active party) or execute backdoor attacks to VFL. Meanwhile, active party can also infer sensitive feature information from passive party. All these pose new privacy and security challenges to VFL systems. We propose a new general defense method which limits the mutual information between private raw data, including both features and labels, and intermediate outputs to achieve a better trade-off between model utility and privacy. We term this defense Mutual Information Regularization Defense (MID). We theoretically and experimentally testify the effectiveness of our MID method in defending existing attacks in VFL, including label inference attacks, backdoor attacks and feature reconstruction attacks.  ( 2 min )
    Backdoor Attacks Against Dataset Distillation. (arXiv:2301.01197v1 [cs.CR])
    Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.  ( 2 min )
    A Survey On Few-shot Knowledge Graph Completion with Structural and Commonsense Knowledge. (arXiv:2301.01172v1 [cs.CL])
    Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.  ( 2 min )
    Speed up the inference of diffusion models via shortcut MCMC sampling. (arXiv:2301.01206v1 [cs.CV])
    Diffusion probabilistic models have generated high quality image synthesis recently. However, one pain point is the notorious inference to gradually obtain clear images with thousands of steps, which is time consuming compared to other generative models. In this paper, we present a shortcut MCMC sampling algorithm, which balances training and inference, while keeping the generated data's quality. In particular, we add the global fidelity constraint with shortcut MCMC sampling to combat the local fitting from diffusion models. We do some initial experiments and show very promising results. Our implementation is available at https://github.com//vividitytech/diffusion-mcmc.git.  ( 2 min )
    Conservation Tools: The Next Generation of Engineering--Biology Collaborations. (arXiv:2301.01103v1 [q-bio.QM])
    The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.  ( 2 min )
    Cluster-guided Contrastive Graph Clustering Network. (arXiv:2301.01098v1 [cs.LG])
    Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.  ( 2 min )
    MERLIN: Multi-agent offline and transfer learning for occupant-centric energy flexible operation of grid-interactive communities using smart meter data and CityLearn. (arXiv:2301.01148v1 [cs.LG])
    The decarbonization of buildings presents new challenges for the reliability of the electrical grid as a result of the intermittency of renewable energy sources and increase in grid load brought about by end-use electrification. To restore reliability, grid-interactive efficient buildings can provide flexibility services to the grid through demand response. Residential demand response programs are hindered by the need for manual intervention by customers. To maximize the energy flexibility potential of residential buildings, an advanced control architecture is needed. Reinforcement learning is well-suited for the control of flexible resources as it is able to adapt to unique building characteristics compared to expert systems. Yet, factors hindering the adoption of RL in real-world applications include its large data requirements for training, control security and generalizability. Here we address these challenges by proposing the MERLIN framework and using a digital twin of a real-world 17-building grid-interactive residential community in CityLearn. We show that 1) independent RL-controllers for batteries improve building and district level KPIs compared to a reference RBC by tailoring their policies to individual buildings, 2) despite unique occupant behaviours, transferring the RL policy of any one of the buildings to other buildings provides comparable performance while reducing the cost of training, 3) training RL-controllers on limited temporal data that does not capture full seasonality in occupant behaviour has little effect on performance. Although, the zero-net-energy (ZNE) condition of the buildings could be maintained or worsened as a result of controlled batteries, KPIs that are typically improved by ZNE condition (electricity price and carbon emissions) are further improved when the batteries are managed by an advanced controller.  ( 2 min )
    Invalidator: Automated Patch Correctness Assessment via Semantic and Syntactic Reasoning. (arXiv:2301.01113v1 [cs.SE])
    In this paper, we propose a novel technique, namely INVALIDATOR, to automatically assess the correctness of APR-generated patches via semantic and syntactic reasoning. INVALIDATOR reasons about program semantic via program invariants while it also captures program syntax via language semantic learned from large code corpus using the pre-trained language model. Given a buggy program and the developer-patched program, INVALIDATOR infers likely invariants on both programs. Then, INVALIDATOR determines that a APR-generated patch overfits if: (1) it violates correct specifications or (2) maintains errors behaviors of the original buggy program. In case our approach fails to determine an overfitting patch based on invariants, INVALIDATOR utilizes a trained model from labeled patches to assess patch correctness based on program syntax. The benefit of INVALIDATOR is three-fold. First, INVALIDATOR is able to leverage both semantic and syntactic reasoning to enhance its discriminant capability. Second, INVALIDATOR does not require new test cases to be generated but instead only relies on the current test suite and uses invariant inference to generalize the behaviors of a program. Third, INVALIDATOR is fully automated. We have conducted our experiments on a dataset of 885 patches generated on real-world programs in Defects4J. Experiment results show that INVALIDATOR correctly classified 79% overfitting patches, accounting for 23% more overfitting patches being detected by the best baseline. INVALIDATOR also substantially outperforms the best baselines by 14% and 19% in terms of Accuracy and F-Measure, respectively.  ( 2 min )
    Boosting Neural Networks to Decompile Optimized Binaries. (arXiv:2301.00969v1 [cs.LG])
    Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.  ( 2 min )
    Computing the Performance of A New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards. (arXiv:2301.01107v1 [stat.CO])
    Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards.  ( 2 min )
    Benchmarking common uncertainty estimation methods with histopathological images under domain shift and label noise. (arXiv:2301.01054v1 [eess.IV])
    In the past years, deep learning has seen an increase of usage in the domain of histopathological applications. However, while these approaches have shown great potential, in high-risk environments deep learning models need to be able to judge their own uncertainty and be able to reject inputs when there is a significant chance of misclassification. In this work, we conduct a rigorous evaluation of the most commonly used uncertainty and robustness methods for the classification of Whole-Slide-Images under domain shift using the H\&E stained Camelyon17 breast cancer dataset. Although it is known that histopathological data can be subject to strong domain shift and label noise, to our knowledge this is the first work that compares the most common methods for uncertainty estimation under these aspects. In our experiments, we compare Stochastic Variational Inference, Monte-Carlo Dropout, Deep Ensembles, Test-Time Data Augmentation as well as combinations thereof. We observe that ensembles of methods generally lead to higher accuracies and better calibration and that Test-Time Data Augmentation can be a promising alternative when choosing an appropriate set of augmentations. Across methods, a rejection of the most uncertain tiles leads to a significant increase in classification accuracy on both in-distribution as well as out-of-distribution data. Furthermore, we conduct experiments comparing these methods under varying conditions of label noise. We observe that the border regions of the Camelyon17 dataset are subject to label noise and evaluate the robustness of the included methods against different noise levels. Lastly, we publish our code framework to facilitate further research on uncertainty estimation on histopathological data.  ( 2 min )
    RELIANT: Fair Knowledge Distillation for Graph Neural Networks. (arXiv:2301.01150v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.  ( 2 min )
    Vocabulary-informed Zero-shot and Open-set Learning. (arXiv:2301.00998v1 [cs.CV])
    Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.  ( 2 min )
    xDeepInt: a hybrid architecture for modeling the vector-wise and bit-wise feature interactions. (arXiv:2301.01089v1 [cs.LG])
    Learning feature interactions is the key to success for the large-scale CTR prediction and recommendation. In practice, handcrafted feature engineering usually requires exhaustive searching. In order to reduce the high cost of human efforts in feature engineering, researchers propose several deep neural networks (DNN)-based approaches to learn the feature interactions in an end-to-end fashion. However, existing methods either do not learn both vector-wise interactions and bit-wise interactions simultaneously, or fail to combine them in a controllable manner. In this paper, we propose a new model, xDeepInt, based on a novel network architecture called polynomial interaction network (PIN) which learns higher-order vector-wise interactions recursively. By integrating subspace-crossing mechanism, we enable xDeepInt to balance the mixture of vector-wise and bit-wise feature interactions at a bounded order. Based on the network architecture, we customize a combined optimization strategy to conduct feature selection and interaction selection. We implement the proposed model and evaluate the model performance on three real-world datasets. Our experiment results demonstrate the efficacy and effectiveness of xDeepInt over state-of-the-art models. We open-source the TensorFlow implementation of xDeepInt: https://github.com/yanyachen/xDeepInt.  ( 2 min )
    On the causality-preservation capabilities of generative modelling. (arXiv:2301.01109v1 [cs.LG])
    Modeling lies at the core of both the financial and the insurance industry for a wide variety of tasks. The rise and development of machine learning and deep learning models have created many opportunities to improve our modeling toolbox. Breakthroughs in these fields often come with the requirement of large amounts of data. Such large datasets are often not publicly available in finance and insurance, mainly due to privacy and ethics concerns. This lack of data is currently one of the main hurdles in developing better models. One possible option to alleviating this issue is generative modeling. Generative models are capable of simulating fake but realistic-looking data, also referred to as synthetic data, that can be shared more freely. Generative Adversarial Networks (GANs) is such a model that increases our capacity to fit very high-dimensional distributions of data. While research on GANs is an active topic in fields like computer vision, they have found limited adoption within the human sciences, like economics and insurance. Reason for this is that in these fields, most questions are inherently about identification of causal effects, while to this day neural networks, which are at the center of the GAN framework, focus mostly on high-dimensional correlations. In this paper we study the causal preservation capabilities of GANs and whether the produced synthetic data can reliably be used to answer causal questions. This is done by performing causal analyses on the synthetic data, produced by a GAN, with increasingly more lenient assumptions. We consider the cross-sectional case, the time series case and the case with a complete structural model. It is shown that in the simple cross-sectional scenario where correlation equals causation the GAN preserves causality, but that challenges arise for more advanced analyses.  ( 2 min )
    A Theory of I/O-Efficient Sparse Neural Network Inference. (arXiv:2301.01048v1 [cs.DC])
    As the accuracy of machine learning models increases at a fast rate, so does their demand for energy and compute resources. On a low level, the major part of these resources is consumed by data movement between different memory units. Modern hardware architectures contain a form of fast memory (e.g., cache, registers), which is small, and a slow memory (e.g., DRAM), which is larger but expensive to access. We can only process data that is stored in fast memory, which incurs data movement (input/output-operations, or I/Os) between the two units. In this paper, we provide a rigorous theoretical analysis of the I/Os needed in sparse feedforward neural network (FFNN) inference. We establish bounds that determine the optimal number of I/Os up to a factor of 2 and present a method that uses a number of I/Os within that range. Much of the I/O-complexity is determined by a few high-level properties of the FFNN (number of inputs, outputs, neurons, and connections), but if we want to get closer to the exact lower bound, the instance-specific sparsity patterns need to be considered. Departing from the 2-optimal computation strategy, we show how to reduce the number of I/Os further with simulated annealing. Complementing this result, we provide an algorithm that constructively generates networks with maximum I/O-efficiency for inference. We test the algorithms and empirically verify our theoretical and algorithmic contributions. In our experiments on real hardware we observe speedups of up to 45$\times$ relative to the standard way of performing inference.  ( 2 min )
    KoopmanLab: A PyTorch module of Koopman neural operator family for solving partial differential equations. (arXiv:2301.01104v1 [cs.LG])
    Given the increasingly intricate forms of partial differential equations (PDEs) in physics and related fields, computationally solving PDEs without analytic solutions inevitably suffers from the trade-off between accuracy and efficiency. Recent advances in neural operators, a kind of mesh-independent neural-network-based PDE solvers, have suggested the dawn of overcoming this challenge. In this emerging direction, Koopman neural operator (KNO) is a representative demonstration and outperforms other state-of-the-art alternatives in terms of accuracy and efficiency. Here we present KoopmanLab, a self-contained and user-friendly PyTorch module of the Koopman neural operator family for solving partial differential equations. Beyond the original version of KNO, we develop multiple new variants of KNO based on different neural network architectures to improve the general applicability of our module. These variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier-Stokes equation and the Bateman-Burgers equation) and ERA5 (i.e., one of the largest high-resolution data sets of global-scale climate fields). These demonstrations suggest the potential of KoopmanLab to be considered in diverse applications of partial differential equations.  ( 2 min )
    Uncertainty in Real-Time Semantic Segmentation on Embedded Systems. (arXiv:2301.01201v1 [cs.CV])
    Application for semantic segmentation models in areas such as autonomous vehicles and human computer interaction require real-time predictive capabilities. The challenges of addressing real-time application is amplified by the need to operate on resource constrained hardware. Whilst development of real-time methods for these platforms has increased, these models are unable to sufficiently reason about uncertainty present. This paper addresses this by combining deep feature extraction from pre-trained models with Bayesian regression and moment propagation for uncertainty aware predictions. We demonstrate how the proposed method can yield meaningful uncertainty on embedded hardware in real-time whilst maintaining predictive performance.  ( 2 min )
    Deep R Programming. (arXiv:2301.01188v1 [cs.PL])
    Deep R Programming is a comprehensive course on one of the most popular languages in data science (statistical computing, graphics, machine learning, data wrangling and analytics). It introduces the base language in-depth and is aimed at ambitious students, practitioners, and researchers who would like to become independent users of this powerful environment. This textbook is a non-profit project. Its online and PDF versions are freely available at . This early draft is distributed in the hope that it will be useful.  ( 2 min )
    Risk-Averse MDPs under Reward Ambiguity. (arXiv:2301.01045v1 [cs.LG])
    We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.  ( 2 min )
    Effective and Efficient Training for Sequential Recommendation Using Cumulative Cross-Entropy Loss. (arXiv:2301.00979v1 [cs.IR])
    Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.  ( 2 min )
    Continual Treatment Effect Estimation: Challenges and Opportunities. (arXiv:2301.01026v1 [cs.LG])
    A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.  ( 2 min )
    Dissecting Continual Learning a Structural and Data Analysis. (arXiv:2301.01033v1 [cs.CV])
    Continual Learning (CL) is a field dedicated to devise algorithms able to achieve lifelong learning. Overcoming the knowledge disruption of previously acquired concepts, a drawback affecting deep learning models and that goes by the name of catastrophic forgetting, is a hard challenge. Currently, deep learning methods can attain impressive results when the data modeled does not undergo a considerable distributional shift in subsequent learning sessions, but whenever we expose such systems to this incremental setting, performance drop very quickly. Overcoming this limitation is fundamental as it would allow us to build truly intelligent systems showing stability and plasticity. Secondly, it would allow us to overcome the onerous limitation of retraining these architectures from scratch with the new updated data. In this thesis, we tackle the problem from multiple directions. In a first study, we show that in rehearsal-based techniques (systems that use memory buffer), the quantity of data stored in the rehearsal buffer is a more important factor over the quality of the data. Secondly, we propose one of the early works of incremental learning on ViTs architectures, comparing functional, weight and attention regularization approaches and propose effective novel a novel asymmetric loss. At the end we conclude with a study on pretraining and how it affects the performance in Continual Learning, raising some questions about the effective progression of the field. We then conclude with some future directions and closing remarks.  ( 2 min )
    Through-life Monitoring of Resource-constrained Systems and Fleets. (arXiv:2301.01017v1 [cs.LG])
    A Digital Twin (DT) is a simulation of a physical system that provides information to make decisions that add economic, social or commercial value. The behaviour of a physical system changes over time, a DT must therefore be continually updated with data from the physical systems to reflect its changing behaviour. For resource-constrained systems, updating a DT is non-trivial because of challenges such as on-board learning and the off-board data transfer. This paper presents a framework for updating data-driven DTs of resource-constrained systems geared towards system health monitoring. The proposed solution consists of: (1) an on-board system running a light-weight DT allowing the prioritisation and parsimonious transfer of data generated by the physical system; and (2) off-board robust updating of the DT and detection of anomalous behaviours. Two case studies are considered using a production gas turbine engine system to demonstrate the digital representation accuracy for real-world, time-varying physical systems.  ( 2 min )
    A Theory of Human-Like Few-Shot Learning. (arXiv:2301.01047v1 [cs.LG])
    We aim to bridge the gap between our common-sense few-sample human learning and large-data machine learning. We derive a theory of human-like few-shot learning from von-Neuman-Landauer's principle. modelling human learning is difficult as how people learn varies from one to another. Under commonly accepted definitions, we prove that all human or animal few-shot learning, and major models including Free Energy Principle and Bayesian Program Learning that model such learning, approximate our theory, under Church-Turing thesis. We find that deep generative model like variational autoencoder (VAE) can be used to approximate our theory and perform significantly better than baseline models including deep neural networks, for image recognition, low resource language processing, and character recognition.  ( 2 min )
    Meta-learning generalizable dynamics from trajectories. (arXiv:2301.00957v1 [cs.LG])
    We present the interpretable meta neural ordinary differential equation (iMODE) method to rapidly learn generalizable (i.e., not parameter-specific) dynamics from trajectories of multiple dynamical systems that vary in their physical parameters. The iMODE method learns meta-knowledge, the functional variations of the force field of dynamical system instances without knowing the physical parameters, by adopting a bi-level optimization framework: an outer level capturing the common force field form among studied dynamical system instances and an inner level adapting to individual system instances. A priori physical knowledge can be conveniently embedded in the neural network architecture as inductive bias, such as conservative force field and Euclidean symmetry. With the learned meta-knowledge, iMODE can model an unseen system within seconds, and inversely reveal knowledge on the physical parameters of a system, or as a Neural Gauge to "measure" the physical parameters of an unseen system with observed trajectories. We test the validity of the iMODE method on bistable, double pendulum, Van der Pol, Slinky, and reaction-diffusion systems.  ( 2 min )
    e-Inu: Simulating A Quadruped Robot With Emotional Sentience. (arXiv:2301.00964v1 [cs.RO])
    Quadruped robots are currently used in industrial robotics as mechanical aid to automate several routine tasks. However, presently, the usage of such a robot in a domestic setting is still very much a part of the research. This paper discusses the understanding and virtual simulation of such a robot capable of detecting and understanding human emotions, generating its gait, and responding via sounds and expression on a screen. To this end, we use a combination of reinforcement learning and software engineering concepts to simulate a quadruped robot that can understand emotions, navigate through various terrains and detect sound sources, and respond to emotions using audio-visual feedback. This paper aims to establish the framework of simulating a quadruped robot that is emotionally intelligent and can primarily respond to audio-visual stimuli using motor or audio response. The emotion detection from the speech was not as performant as ERANNs or Zeta Policy learning, still managing an accuracy of 63.5%. The video emotion detection system produced results that are almost at par with the state of the art, with an accuracy of 99.66%. Due to its "on-policy" learning process, the PPO algorithm was extremely rapid to learn, allowing the simulated dog to demonstrate a remarkably seamless gait across the different cadences and variations. This enabled the quadruped robot to respond to generated stimuli, allowing us to conclude that it functions as predicted and satisfies the aim of this work.  ( 2 min )
    Data Valuation Without Training of a Model. (arXiv:2301.00930v1 [cs.LG])
    Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model, either by analyzing the behavior of the model during training or by measuring the performance gap of the model when the instance is removed from the dataset. Such approaches reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding 'irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics.  ( 2 min )
    Improving Performance in Neural Networks by Dendrites-Activated Connections. (arXiv:2301.00924v1 [cs.NE])
    Computational units in artificial neural networks follow a simplified model of biological neurons. In the biological model, the output signal of a neuron runs down the axon, splits following the many branches at its end, and passes identically to all the downward neurons of the network. Each of the downward neurons will use their copy of this signal as one of many inputs dendrites, integrate them all and fire an output, if above some threshold. In the artificial neural network, this translates to the fact that the nonlinear filtering of the signal is performed in the upward neuron, meaning that in practice the same activation is shared between all the downward neurons that use that signal as their input. Dendrites thus play a passive role. We propose a slightly more complex model for the biological neuron, where dendrites play an active role: the activation in the output of the upward neuron becomes optional, and instead the signals going through each dendrite undergo independent nonlinear filterings, before the linear combination. We implement this new model into a ReLU computational unit and discuss its biological plausibility. We compare this new computational unit with the standard one and describe it from a geometrical point of view. We provide a Keras implementation of this unit into fully connected and convolutional layers and estimate their FLOPs and weights change. We then use these layers in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, obtaining performance improvements over standard ResNets up to 1.73%. Finally, we prove a universal representation theorem for continuous functions on compact sets and show that this new unit has more representational power than its standard counterpart.  ( 2 min )
    Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition. (arXiv:2301.00986v1 [cs.CV])
    Deep neural networks (DNNs) are vulnerable to a class of attacks called "backdoor attacks", which create an association between a backdoor trigger and a target label the attacker is interested in exploiting. A backdoored DNN performs well on clean test images, yet persistently predicts an attacker-defined label for any sample in the presence of the backdoor trigger. Although backdoor attacks have been extensively studied in the image domain, there are very few works that explore such attacks in the video domain, and they tend to conclude that image backdoor attacks are less effective in the video domain. In this work, we revisit the traditional backdoor threat model and incorporate additional video-related aspects to that model. We show that poisoned-label image backdoor attacks could be extended temporally in two ways, statically and dynamically, leading to highly effective attacks in the video domain. In addition, we explore natural video backdoors to highlight the seriousness of this vulnerability in the video domain. And, for the first time, we study multi-modal (audiovisual) backdoor attacks against video action recognition models, where we show that attacking a single modality is enough for achieving a high attack success rate.  ( 2 min )
    Deep Spectral Q-learning with Application to Mobile Health. (arXiv:2301.00927v1 [stat.ML])
    Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.  ( 2 min )
    Safe Reinforcement Learning for an Energy-Efficient Driver Assistance System. (arXiv:2301.00904v1 [cs.RO])
    Reinforcement learning (RL)-based driver assistance systems seek to improve fuel consumption via continual improvement of powertrain control actions considering experiential data from the field. However, the need to explore diverse experiences in order to learn optimal policies often limits the application of RL techniques in safety-critical systems like vehicle control. In this paper, an exponential control barrier function (ECBF) is derived and utilized to filter unsafe actions proposed by an RL-based driver assistance system. The RL agent freely explores and optimizes the performance objectives while unsafe actions are projected to the closest actions in the safe domain. The reward is structured so that driver's acceleration requests are met in a manner that boosts fuel economy and doesn't compromise comfort. The optimal gear and traction torque control actions that maximize the cumulative reward are computed via the Maximum a Posteriori Policy Optimization (MPO) algorithm configured for a hybrid action space. The proposed safe-RL scheme is trained and evaluated in car following scenarios where it is shown that it effectively avoids collision both during training and evaluation while delivering on the expected fuel economy improvements for the driver assistance system.  ( 2 min )
    Exploring Complex Dynamical Systems via Nonconvex Optimization. (arXiv:2301.00923v1 [cs.LG])
    Cataloging the complex behaviors of dynamical systems can be challenging, even when they are well-described by a simple mechanistic model. If such a system is of limited analytical tractability, brute force simulation is often the only resort. We present an alternative, optimization-driven approach using tools from machine learning. We apply this approach to a novel, fully-optimizable, reaction-diffusion model which incorporates complex chemical reaction networks (termed "Dense Reaction-Diffusion Network" or "Dense RDN"). This allows us to systematically identify new states and behaviors, including pattern formation, dissipation-maximizing nonequilibrium states, and replication-like dynamical structures.  ( 2 min )
    Ranking Differential Privacy. (arXiv:2301.00841v1 [stat.ML])
    Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $\epsilon$-differential privacy. This paper proposes a novel notion called $\epsilon$-ranking differential privacy for protecting ranks. We establish the connection between the Mallows model (Mallows, 1957) and the proposed $\epsilon$-ranking differential privacy. This allows us to develop a multistage ranking algorithm to generate synthetic rankings while satisfying the developed $\epsilon$-ranking differential privacy. Theoretical results regarding the utility of synthetic rankings in the downstream tasks, including the inference attack and the personalized ranking tasks, are established. For the inference attack, we quantify how $\epsilon$ affects the estimation of the true ranking based on synthetic rankings. For the personalized ranking task, we consider varying privacy preferences among users and quantify how their privacy preferences affect the consistency in estimating the optimal ranking function. Extensive numerical experiments are carried out to verify the theoretical results and demonstrate the effectiveness of the proposed synthetic ranking algorithm.  ( 2 min )
    Faster Approximate Dynamic Programming by Freezing Slow States. (arXiv:2301.00922v1 [cs.AI])
    We consider infinite horizon Markov decision processes (MDPs) with fast-slow structure, meaning that certain parts of the state space move "fast" (and in a sense, are more influential) while other parts transition more "slowly." Such structure is common in real-world problems where sequential decisions need to be made at high frequencies, yet information that varies at a slower timescale also influences the optimal policy. Examples include: (1) service allocation for a multi-class queue with (slowly varying) stochastic costs, (2) a restless multi-armed bandit with an environmental state, and (3) energy demand response, where both day-ahead and real-time prices play a role in the firm's revenue. Models that fully capture these problems often result in MDPs with large state spaces and large effective time horizons (due to frequent decisions), rendering them computationally intractable. We propose an approximate dynamic programming algorithmic framework based on the idea of "freezing" the slow states, solving a set of simpler finite-horizon MDPs (the lower-level MDPs), and applying value iteration (VI) to an auxiliary MDP that transitions on a slower timescale (the upper-level MDP). We also extend the technique to a function approximation setting, where a feature-based linear architecture is used. On the theoretical side, we analyze the regret incurred by each variant of our frozen-state approach. Finally, we give empirical evidence that the frozen-state approach generates effective policies using just a fraction of the computational cost, while illustrating that simply omitting slow states from the decision modeling is often not a viable heuristic.  ( 2 min )
    Deep reinforcement learning for irrigation scheduling using high-dimensional sensor feedback. (arXiv:2301.00899v1 [cs.LG])
    Deep reinforcement learning has considerable potential to improve irrigation scheduling in many cropping systems by applying adaptive amounts of water based on various measurements over time. The goal is to discover an intelligent decision rule that processes information available to growers and prescribes sensible irrigation amounts for the time steps considered. Due to the technical novelty, however, the research on the technique remains sparse and impractical. To accelerate the progress, the paper proposes a general framework and actionable procedure that allow researchers to formulate their own optimisation problems and implement solution algorithms based on deep reinforcement learning. The effectiveness of the framework was demonstrated using a case study of irrigated wheat grown in a productive region of Australia where profits were maximised. Specifically, the decision rule takes nine state variable inputs: crop phenological stage, leaf area index, extractable soil water for each of the five top layers, cumulative rainfall and cumulative irrigation. It returns a probabilistic prescription over five candidate irrigation amounts (0, 10, 20, 30 and 40 mm) every day. The production system was simulated at Goondiwindi using the APSIM-Wheat crop model. After training in the learning environment using 1981--2010 weather data, the learned decision rule was tested individually for each year of 2011--2020. The results were compared against the benchmark profits obtained using irrigation schedules optimised individually for each of the considered years. The discovered decision rule prescribed daily irrigation amounts that achieved more than 96% of the benchmark profits. The framework is general and applicable to a wide range of cropping systems with realistic optimisation problems.  ( 2 min )
    Deep Learning and Computational Physics (Lecture Notes). (arXiv:2301.00942v1 [cs.LG])
    These notes were compiled as lecture notes for a course developed and taught at the University of the Southern California. They should be accessible to a typical engineering graduate student with a strong background in Applied Mathematics. The main objective of these notes is to introduce a student who is familiar with concepts in linear algebra and partial differential equations to select topics in deep learning. These lecture notes exploit the strong connections between deep learning algorithms and the more conventional techniques of computational physics to achieve two goals. First, they use concepts from computational physics to develop an understanding of deep learning algorithms. Not surprisingly, many concepts in deep learning can be connected to similar concepts in computational physics, and one can utilize this connection to better understand these algorithms. Second, several novel deep learning algorithms can be used to solve challenging problems in computational physics. Thus, they offer someone who is interested in modeling a physical phenomena with a complementary set of tools.  ( 2 min )
    One-shot domain adaptation in video-based assessment of surgical skills. (arXiv:2301.00812v1 [cs.CV])
    Deep Learning (DL) has achieved automatic and objective assessment of surgical skills. However, DL models are data-hungry and restricted to their training domain. This prevents them from transitioning to new tasks where data is limited. Hence, domain adaptation is crucial to implement DL in real life. Here, we propose a meta-learning model, A-VBANet, that can deliver domain-agnostic surgical skill classification via one-shot learning. We develop the A-VBANet on five laparoscopic and robotic surgical simulators. Additionally, we test it on operating room (OR) videos of laparoscopic cholecystectomy. Our model successfully adapts with accuracies up to 99.5% in one-shot and 99.9% in few-shot settings for simulated tasks and 89.7% for laparoscopic cholecystectomy. For the first time, we provide a domain-agnostic procedure for video-based assessment of surgical skills. A significant implication of this approach is that it allows the use of data from surgical simulators to assess performance in the operating room.  ( 2 min )
    OF-AE: Oblique Forest AutoEncoders. (arXiv:2301.00880v1 [cs.LG])
    In the present work we propose an unsupervised ensemble method consisting of oblique trees that can address the task of auto-encoding, namely Oblique Forest AutoEncoders (briefly OF-AE). Our method is a natural extension of the eForest encoder introduced in [1]. More precisely, by employing oblique splits consisting in multivariate linear combination of features instead of the axis-parallel ones, we will devise an auto-encoder method through the computation of a sparse solution of a set of linear inequalities consisting of feature values constraints. The code for reproducing our results is available at https://github.com/CDAlecsa/Oblique-Forest-AutoEncoders.  ( 2 min )
    Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus on Videos. (arXiv:2301.00896v1 [cs.CV])
    Adversarial robustness assessment for video recognition models has raised concerns owing to their wide applications on safety-critical tasks. Compared with images, videos have much high dimension, which brings huge computational costs when generating adversarial videos. This is especially serious for the query-based black-box attacks where gradient estimation for the threat models is usually utilized, and high dimensions will lead to a large number of queries. To mitigate this issue, we propose to simultaneously eliminate the temporal and spatial redundancy within the video to achieve an effective and efficient gradient estimation on the reduced searching space, and thus query number could decrease. To implement this idea, we design the novel Adversarial spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on the simultaneously focused key frames and key regions from the inter-frames and intra-frames in the video. AstFocus attack is based on the cooperative Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible for selecting key frames, and another agent is responsible for selecting key regions. These two agents are jointly trained by the common rewards received from the black-box threat models to perform a cooperative prediction. By continuously querying, the reduced searching space composed of key frames and key regions is becoming precise, and the whole query number becomes less than that on the original video. Extensive experiments on four mainstream video recognition models and three widely used action recognition datasets demonstrate that the proposed AstFocus attack outperforms the SOTA methods, which is prevenient in fooling rate, query number, time, and perturbation magnitude at the same.  ( 2 min )
    Multidimensional Item Response Theory in the Style of Collaborative Filtering. (arXiv:2301.00909v1 [stat.ML])
    This paper presents a machine learning approach to multidimensional item response theory (MIRT), a class of latent factor models that can be used to model and predict student performance from observed assessment data. Inspired by collaborative filtering, we define a general class of models that includes many MIRT models. We discuss the use of penalized joint maximum likelihood (JML) to estimate individual models and cross-validation to select the best performing model. This model evaluation process can be optimized using batching techniques, such that even sparse large-scale data can be analyzed efficiently. We illustrate our approach with simulated and real data, including an example from a massive open online course (MOOC). The high-dimensional model fit to this large and sparse dataset does not lend itself well to traditional methods of factor interpretation. By analogy to recommender-system applications, we propose an alternative "validation" of the factor model, using auxiliary information about the popularity of items consulted during an open-book exam in the course.  ( 2 min )
    A Concurrent CNN-RNN Approach for Multi-Step Wind Power Forecasting. (arXiv:2301.00819v1 [cs.LG])
    Wind power forecasting helps with the planning for the power systems by contributing to having a higher level of certainty in decision-making. Due to the randomness inherent to meteorological events (e.g., wind speeds), making highly accurate long-term predictions for wind power can be extremely difficult. One approach to remedy this challenge is to utilize weather information from multiple points across a geographical grid to obtain a holistic view of the wind patterns, along with temporal information from the previous power outputs of the wind farms. Our proposed CNN-RNN architecture combines convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract spatial and temporal information from multi-dimensional input data to make day-ahead predictions. In this regard, our method incorporates an ultra-wide learning view, combining data from multiple numerical weather prediction models, wind farms, and geographical locations. Additionally, we experiment with global forecasting approaches to understand the impact of training the same model over the datasets obtained from multiple different wind farms, and we employ a method where spatial information extracted from convolutional layers is passed to a tree ensemble (e.g., Light Gradient Boosting Machine (LGBM)) instead of fully connected layers. The results show that our proposed CNN-RNN architecture outperforms other models such as LGBM, Extra Tree regressor and linear regression when trained globally, but fails to replicate such performance when trained individually on each farm. We also observe that passing the spatial information from CNN to LGBM improves its performance, providing further evidence of CNN's spatial feature extraction capabilities.  ( 2 min )
    Tweet's popularity dynamics. (arXiv:2301.00853v1 [cs.LG])
    This article charts the work of a 4 month project aimed at automatically identifying patterns of tweets popularity evolution using Machine Learning and Deep Learning techniques. To apprehend both the data and the extent of the problem, a straightforward clustering algorithm based on a point to point distance is used. Then, in an attempt to refine the algorithm, various analyses especially using feature extraction techniques are conducted. Although the algorithm eventually fails to automate such a task, this exercise raises a complex but necessary issue touching on the impact of virality on social networks.  ( 2 min )
    Distributed Machine Learning for UAV Swarms: Computing, Sensing, and Semantics. (arXiv:2301.00912v1 [cs.LG])
    Unmanned aerial vehicle (UAV) swarms are considered as a promising technique for next-generation communication networks due to their flexibility, mobility, low cost, and the ability to collaboratively and autonomously provide services. Distributed learning (DL) enables UAV swarms to intelligently provide communication services, multi-directional remote surveillance, and target tracking. In this survey, we first introduce several popular DL algorithms such as federated learning (FL), multi-agent Reinforcement Learning (MARL), distributed inference, and split learning, and present a comprehensive overview of their applications for UAV swarms, such as trajectory design, power control, wireless resource allocation, user assignment, perception, and satellite communications. Then, we present several state-of-the-art applications of UAV swarms in wireless communication systems, such us reconfigurable intelligent surface (RIS), virtual reality (VR), semantic communications, and discuss the problems and challenges that DL-enabled UAV swarms can solve in these applications. Finally, we describe open problems of using DL in UAV swarms and future research directions of DL enabled UAV swarms. In summary, this survey provides a comprehensive survey of various DL applications for UAV swarms in extensive scenarios.  ( 2 min )
    NeuroExplainer: Fine-Grained Attention Decoding to Uncover Cortical Development Patterns of Preterm Infants. (arXiv:2301.00815v1 [cs.LG])
    Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.  ( 2 min )
    Robust Average-Reward Markov Decision Processes. (arXiv:2301.00858v1 [cs.LG])
    In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor $\gamma$ goes to $1$, and moreover, when $\gamma$ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.  ( 2 min )
    A Survey on Protein Representation Learning: Retrospect and Prospect. (arXiv:2301.00813v1 [cs.LG])
    Proteins are fundamental biological entities that play a key role in life activities. The amino acid sequences of proteins can be folded into stable 3D structures in the real physicochemical world, forming a special kind of sequence-structure data. With the development of Artificial Intelligence (AI) techniques, Protein Representation Learning (PRL) has recently emerged as a promising research topic for extracting informative knowledge from massive protein sequences or structures. To pave the way for AI researchers with little bioinformatics background, we present a timely and comprehensive review of PRL formulations and existing PRL methods from the perspective of model architectures, pretext tasks, and downstream applications. We first briefly introduce the motivations for protein representation learning and formulate it in a general and unified framework. Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling. Finally, we discuss some technical challenges and potential directions for improving protein representation learning. The latest advances in PRL methods are summarized in a GitHub repository https://github.com/LirongWu/awesome-protein-representation-learning.  ( 2 min )
  • Open

    Adaptive Sampling for Discovery. (arXiv:2205.14829v3 [stat.ML] UPDATED)
    In this paper, we study a sequential decision-making problem, called Adaptive Sampling for Discovery (ASD). Starting with a large unlabeled dataset, algorithms for ASD adaptively label the points with the goal to maximize the sum of responses. This problem has wide applications to real-world discovery problems, for example drug discovery with the help of machine learning models. ASD algorithms face the well-known exploration-exploitation dilemma. The algorithm needs to choose points that yield information to improve model estimates but it also needs to exploit the model. We rigorously formulate the problem and propose a general information-directed sampling (IDS) algorithm. We provide theoretical guarantees for the performance of IDS in linear, graph and low-rank models. The benefits of IDS are shown in both simulation experiments and real-data experiments for discovering chemical reaction conditions.  ( 2 min )
    Offline Reinforcement Learning with Differential Privacy. (arXiv:2206.00810v2 [cs.LG] UPDATED)
    The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.  ( 2 min )
    A Worker-Task Specialization Model for Crowdsourcing: Efficient Inference and Fundamental Limits. (arXiv:2111.12550v2 [cs.HC] UPDATED)
    Crowdsourcing system has emerged as an effective platform for labeling data with relatively low cost by using non-expert workers. Inferring correct labels from multiple noisy answers on data, however, has been a challenging problem, since the quality of the answers varies widely across tasks and workers. Many existing works have assumed that there is a fixed ordering of workers in terms of their skill levels, and focused on estimating worker skills to aggregate the answers from workers with different weights. In practice, however, the worker skill changes widely across tasks, especially when the tasks are heterogeneous. In this paper, we consider a new model, called $d$-type specialization model, in which each task and worker has its own (unknown) type and the reliability of each worker can vary in the type of a given task and that of a worker. We allow that the number $d$ of types can scale in the number of tasks. In this model, we characterize the optimal sample complexity to correctly infer the labels within any given accuracy, and propose label inference algorithms achieving the order-wise optimal limit even when the types of tasks or those of workers are unknown. We conduct experiments both on synthetic and real datasets, and show that our algorithm outperforms the existing algorithms developed based on more strict model assumptions.  ( 2 min )
    Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles. (arXiv:2110.12067v2 [stat.ML] UPDATED)
    Gaussian graphical models provide a powerful framework for uncovering conditional dependence relationships between sets of nodes; they have found applications in a wide variety of fields including sensor and communication networks, physics, finance, and computational biology. Often, one observes data on the nodes and the task is to learn the graph structure, or perform graphical model selection. While this is a well-studied problem with many popular techniques, there are typically three major practical challenges: i) many existing algorithms become computationally intractable in huge-data settings with tens of thousands of nodes; ii) the need for separate data-driven hyperparameter tuning considerably adds to the computational burden; iii) the statistical accuracy of selected edges often deteriorates as the dimension and/or the complexity of the underlying graph structures increase. We tackle these problems by developing the novel Minipatch Graph (MPGraph) estimator. Our approach breaks up the huge graph learning problem into many smaller problems by creating an ensemble of tiny random subsets of both the observations and the nodes, termed minipatches. We then leverage recent advances that use hard thresholding to solve the latent variable graphical model problem to consistently learn the graph on each minipatch. Our approach is computationally fast, embarrassingly parallelizable, memory efficient, and has integrated stability-based hyperparamter tuning. Additionally, we prove that under weaker assumptions than that of the Graphical Lasso, our MPGraph estimator achieves graph selection consistency. We compare our approach to state-of-the-art computational approaches for Gaussian graphical model selection including the BigQUIC algorithm, and empirically demonstrate that our approach is not only more statistically accurate but also extensively faster for huge graph learning problems.  ( 3 min )
    On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control. (arXiv:2106.08414v2 [cs.LG] UPDATED)
    Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.  ( 2 min )
    xDeepInt: a hybrid architecture for modeling the vector-wise and bit-wise feature interactions. (arXiv:2301.01089v1 [cs.LG])
    Learning feature interactions is the key to success for the large-scale CTR prediction and recommendation. In practice, handcrafted feature engineering usually requires exhaustive searching. In order to reduce the high cost of human efforts in feature engineering, researchers propose several deep neural networks (DNN)-based approaches to learn the feature interactions in an end-to-end fashion. However, existing methods either do not learn both vector-wise interactions and bit-wise interactions simultaneously, or fail to combine them in a controllable manner. In this paper, we propose a new model, xDeepInt, based on a novel network architecture called polynomial interaction network (PIN) which learns higher-order vector-wise interactions recursively. By integrating subspace-crossing mechanism, we enable xDeepInt to balance the mixture of vector-wise and bit-wise feature interactions at a bounded order. Based on the network architecture, we customize a combined optimization strategy to conduct feature selection and interaction selection. We implement the proposed model and evaluate the model performance on three real-world datasets. Our experiment results demonstrate the efficacy and effectiveness of xDeepInt over state-of-the-art models. We open-source the TensorFlow implementation of xDeepInt: https://github.com/yanyachen/xDeepInt.  ( 2 min )
    Dimension-agnostic inference using cross U-statistics. (arXiv:2011.05068v5 [math.ST] UPDATED)
    Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution, regardless of how $d$ scales with $n$. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for some classical problems including one-sample mean and covariance testing, and show that our tests have minimax rate-optimal power against appropriate local alternatives. In most settings, our cross U-statistic matches the high-dimensional power of the corresponding (degenerate) U-statistic up to a $\sqrt{2}$ factor.  ( 2 min )
    Linear chain conditional random fields, hidden Markov models, and related classifiers. (arXiv:2301.01293v1 [stat.ML])
    Practitioners use Hidden Markov Models (HMMs) in different problems for about sixty years. Besides, Conditional Random Fields (CRFs) are an alternative to HMMs and appear in the literature as different and somewhat concurrent models. We propose two contributions. First, we show that basic Linear-Chain CRFs (LC-CRFs), considered as different from the HMMs, are in fact equivalent to them in the sense that for each LC-CRF there exists a HMM - that we specify - whom posterior distribution is identical to the given LC-CRF. Second, we show that it is possible to reformulate the generative Bayesian classifiers Maximum Posterior Mode (MPM) and Maximum a Posteriori (MAP) used in HMMs, as discriminative ones. The last point is of importance in many fields, especially in Natural Language Processing (NLP), as it shows that in some situations dropping HMMs in favor of CRFs was not necessary.  ( 2 min )
    A Tutorial on Parametric Variational Inference. (arXiv:2301.01236v1 [stat.ML])
    Variational inference uses optimization, rather than integration, to approximate the marginal likelihood, and thereby the posterior, in a Bayesian model. Thanks to advances in computational scalability made in the last decade, variational inference is now the preferred choice for many high-dimensional models and large datasets. This tutorial introduces variational inference from the parametric perspective that dominates these recent developments, in contrast to the mean-field perspective commonly found in other introductory texts.  ( 2 min )
    Deep Spectral Q-learning with Application to Mobile Health. (arXiv:2301.00927v1 [stat.ML])
    Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.  ( 2 min )
    Optimal transport with $f$-divergence regularization and generalized Sinkhorn algorithm. (arXiv:2105.14337v2 [math.OC] UPDATED)
    Entropic regularization provides a generalization of the original optimal transport problem. It introduces a penalty term defined by the Kullback-Leibler divergence, making the problem more tractable via the celebrated Sinkhorn algorithm. Replacing the Kullback-Leibler divergence with a general $f$-divergence leads to a natural generalization. The case of divergences defined by superlinear functions was recently studied by Di Marino and Gerolin. Using convex analysis, we extend the theory developed so far to include all $f$-divergences defined by functions of Legendre type, and prove that under some mild conditions, strong duality holds, optimums in both the primal and dual problems are attained, the generalization of the $c$-transform is well-defined, and we give sufficient conditions for the generalized Sinkhorn algorithm to converge to an optimal solution. We propose a practical algorithm for computing an approximate solution of the optimal transport problem with $f$-divergence regularization via the generalized Sinkhorn algorithm. Finally, we present experimental results on synthetic 2-dimensional data, demonstrating the effects of using different $f$-divergences for regularization, which influences convergence speed, numerical stability and sparsity of the optimal coupling.  ( 2 min )
    Continual Treatment Effect Estimation: Challenges and Opportunities. (arXiv:2301.01026v1 [cs.LG])
    A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.  ( 2 min )
    Multidimensional Item Response Theory in the Style of Collaborative Filtering. (arXiv:2301.00909v1 [stat.ML])
    This paper presents a machine learning approach to multidimensional item response theory (MIRT), a class of latent factor models that can be used to model and predict student performance from observed assessment data. Inspired by collaborative filtering, we define a general class of models that includes many MIRT models. We discuss the use of penalized joint maximum likelihood (JML) to estimate individual models and cross-validation to select the best performing model. This model evaluation process can be optimized using batching techniques, such that even sparse large-scale data can be analyzed efficiently. We illustrate our approach with simulated and real data, including an example from a massive open online course (MOOC). The high-dimensional model fit to this large and sparse dataset does not lend itself well to traditional methods of factor interpretation. By analogy to recommender-system applications, we propose an alternative "validation" of the factor model, using auxiliary information about the popularity of items consulted during an open-book exam in the course.  ( 2 min )
    State and parameter learning with PaRIS particle Gibbs. (arXiv:2301.00900v1 [stat.ME])
    Non-linear state-space models, also known as general hidden Markov models, are ubiquitous in statistical machine learning, being the most classical generative models for serial data and sequences in general. The particle-based, rapid incremental smoother PaRIS is a sequential Monte Carlo (SMC) technique allowing for efficient online approximation of expectations of additive functionals under the smoothing distribution in these models. Such expectations appear naturally in several learning contexts, such as likelihood estimation (MLE) and Markov score climbing (MSC). PARIS has linear computational complexity, limited memory requirements and comes with non-asymptotic bounds, convergence results and stability guarantees. Still, being based on self-normalised importance sampling, the PaRIS estimator is biased. Our first contribution is to design a novel additive smoothing algorithm, the Parisian particle Gibbs PPG sampler, which can be viewed as a PaRIS algorithm driven by conditional SMC moves, resulting in bias-reduced estimates of the targeted quantities. We substantiate the PPG algorithm with theoretical results, including new bounds on bias and variance as well as deviation inequalities. Our second contribution is to apply PPG in a learning framework, covering MLE and MSC as special examples. In this context, we establish, under standard assumptions, non-asymptotic bounds highlighting the value of bias reduction and the implicit Rao--Blackwellization of PPG. These are the first non-asymptotic results of this kind in this setting. We illustrate our theoretical results with numerical experiments supporting our claims.  ( 2 min )
    Ranking Differential Privacy. (arXiv:2301.00841v1 [stat.ML])
    Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $\epsilon$-differential privacy. This paper proposes a novel notion called $\epsilon$-ranking differential privacy for protecting ranks. We establish the connection between the Mallows model (Mallows, 1957) and the proposed $\epsilon$-ranking differential privacy. This allows us to develop a multistage ranking algorithm to generate synthetic rankings while satisfying the developed $\epsilon$-ranking differential privacy. Theoretical results regarding the utility of synthetic rankings in the downstream tasks, including the inference attack and the personalized ranking tasks, are established. For the inference attack, we quantify how $\epsilon$ affects the estimation of the true ranking based on synthetic rankings. For the personalized ranking task, we consider varying privacy preferences among users and quantify how their privacy preferences affect the consistency in estimating the optimal ranking function. Extensive numerical experiments are carried out to verify the theoretical results and demonstrate the effectiveness of the proposed synthetic ranking algorithm.  ( 2 min )

  • Open

    How can insurance companies leverage Generative AI?
    submitted by /u/u_dev_92 [link] [comments]  ( 52 min )
    Robo Birds from an 80ies Innovation Lab (ProtogenX34)
    submitted by /u/tebjan [link] [comments]  ( 52 min )
    The equivalent of Midjourney or Dall-E for MUSIC appears : what are your prompts ?
    Let's say an insanely powerful and high res music AI is released. It can create absolutely whatever you want. What would you ask it ? What would be your prompts ? submitted by /u/paulbismuth31 [link] [comments]  ( 52 min )
    AI Dream 140 - This EPIC AI Video will Haunt you - Part 4
    submitted by /u/LordPewPew777 [link] [comments]  ( 52 min )
    AI avatar generators struggle with my face
    I have been seeing a lot of AI generated avatars that are really cool and really accurate so I decided to try. I tried the tiktok one first and it looked actually nothing like me. I then downloaded the app wonder which allows you to upload 10 photos and get 100 pictures and a vast majority of them still didn't look like me. Have other people experienced this? Does anyone know reasons this might be? submitted by /u/dreyhitz27 [link] [comments]  ( 52 min )
    Made a vid talking about the future of AI and how to adapt to it B )
    https://www.youtube.com/watch?v=DUi7v0eIPz4&lc=Ugxuw-xuUSYvVnbfjrV4AaABAg&ab_channel=ScaleAI submitted by /u/Spiritual-Ad4430 [link] [comments]  ( 52 min )
    should i be worried?
    submitted by /u/jpclp [link] [comments]  ( 53 min )
    Google "Muse" generates high-quality AI images at record speed
    submitted by /u/Number_5_alive [link] [comments]  ( 53 min )
    An AI made debate between Karl Marx and Donald Trump.
    submitted by /u/omnisvosscio [link] [comments]  ( 52 min )
    Harry Styles - As It Was (AI video by Aiplague) 4K
    submitted by /u/nalr00n [link] [comments]  ( 52 min )
    Nvidia’s AI tech can upscale videos on the browser itself
    submitted by /u/qptbook [link] [comments]  ( 54 min )
    Get ready for Bing Search with ChatGPT!
    submitted by /u/liquidocelotYT [link] [comments]  ( 61 min )
    What happened in AI research in 2022 - My curated list of AI breakthroughs with a video explanation, article, and code for each paper
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 52 min )
    Will you consider having an AI robot for your friend / lover ? 🤖👭👬👫
    submitted by /u/Friendly_Avalanche [link] [comments]  ( 55 min )
    Which tent tempts the tent witches intentions
    submitted by /u/turnsouttellyouwhat [link] [comments]  ( 53 min )
    I asked AI to make a music video, this is what it made!
    submitted by /u/Branbruce [link] [comments]  ( 53 min )
    How to revoke decisions in ML-EDM ? (video #6)
    submitted by /u/ML-EDM [link] [comments]  ( 57 min )
    AI that automates repetitive tasks in your browser. Enter a task and it controls the browser to carry it out for you. superflows.ai
    submitted by /u/Quackerooney [link] [comments]  ( 58 min )
    🔥 Top A.I. Newsletters of 2023 🧠🐱‍💻🤖🚗🚀
    submitted by /u/BackgroundResult [link] [comments]  ( 63 min )
    AI dev, how should I begin?
    Im a 17 yo in Singapore taking CS in JC, have a background in competitive programming in cpp, coded some discord bots and done some CTFs etc. When I look for online courses for AI devs they seem to be teaching how to use certain pre-built/trained models like tensor and pytorch etc. Would like to ask if any senior AI devs here start the same way (from online courses, basics of tensor/pytorch,..) or what are some better ways to start with AI dev? Also, is an AI's performance depends more on its code/model or the amount of training data? (as a solo-dev, i want to know if i should focus more on writing better code or sorting better training sets) thanks and have a nice day submitted by /u/LampardNK [link] [comments]  ( 62 min )
    OpenAI's ChatGPT set to enhance Microsoft's Bing search capabilities
    submitted by /u/much_successes [link] [comments]  ( 57 min )
    Microsoft Working On ChatGPT-Powered Bing To Challenge Google
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 52 min )
    🔎 Breaking: Bing will have ChatGPT Soon!
    submitted by /u/BackgroundResult [link] [comments]  ( 57 min )
  • Open

    [D] Could neural networks just be like real neurons, but on a different medium? What's the difference between neurons in a brain and neurons in a computer?
    I'm not sure how to articulate this question but it's something i've been thinking too much about submitted by /u/WaggleMcDaggle [link] [comments]  ( 60 min )
    [D] ML in non-tech fields
    Hi all! What are some interesting applications of machine learning in fields that are not typically associated with technology? For example, have there been any successful projects using ML in fields like sociology, psychology, or political science? If so, could you share some links or references? Looking forwards to discovering your use-cases! submitted by /u/fr4nl4u [link] [comments]  ( 61 min )
    [N] Legal NLP Dataset With Over 39,000 Examples Released
    Legal datasets are extremely expensive because lawyers are, and this has bottlenecked legal NLP. To address this, we release the Merger Agreement Understand Dataset (MAUD), with over 39,000 multiple-choice reading comprehension examples for 152 merger agreements that have been manually labeled by legal experts. The dataset was created with the help of the American Bar Association; without their help the dataset would have cost over $5,000,000 to create. MAUD has substantial room for improvement and can could serve as a research challenge for NLP researchers without any legal background. Dataset and Baselines: https://github.com/TheAtticusProject/maud/ Paper: https://arxiv.org/abs/2301.00876 submitted by /u/Sea-Connection462 [link] [comments]  ( 59 min )
    [P] T5 Implementation in PyTorch
    An open-source implementation of Google AI's T5 in PyTorch. This repository contains the architecture to train your own T5 model. Link to the repository: https://github.com/conceptofmind/t5-pytorch T5 was first presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel. You can find a link to the paper here: https://arxiv.org/abs/1910.10683 Lucidrains was kind enough to provide feedback and review for this implementation. Please be sure to follow and support his work: https://github.com/lucidrains You can find the official T5x repository by Google AI here: https://github.com/google-research/t5x submitted by /u/EnricoShippole [link] [comments]  ( 67 min )
    [D] Feature engineering book recommendation for Bioinformatics
    Hi All, What book would you recommend for feature engineering, especially for bioinformatics and structural biology? Happy New Year and thanks in advance. submitted by /u/hovo1990 [link] [comments]  ( 60 min )
    [Discussion] If ML is based on data generated by humans, can it truly outperform humans?
    Hello mates, Since I have hardly any background in ML, I have somewhat dummy question. My understanding is that the majority of ML is based heavily on inputs generated by humans (some exceptions here would be unsupervised learning and GANs). So, if this is a case, I wonder if ML can truly outperform humans. Of course, in certain areas, like speed of computation or accuracy, computers will also be better than humans, but I am more interested in, shall we say, more general case. Kind regards submitted by /u/groman434 [link] [comments]  ( 85 min )
    [R] Measuring similarity between different vectors using Mahalanobis distance
    Hi guys, I have a set of feature values defined as x = {f_1,f_2,...,f_n} (x does not contain any zero) and the goal is to measure the similarity between these features using Mahalanobis distance so x is converted to a diagonal matrix called X_i where the diagonal elements are f_1,f_2,...,f_n, therefore, the distance is measured using columns of X_i. Then I calculate the covariance matrix of X_i which is semi-positive definite (SPD) but the inverse of the covariance matrix is non-SPD and Mahalanobis distance is not valid(it became negative). Any ideas or suggestions? Thanks. submitted by /u/eiliya_20 [link] [comments]  ( 65 min )
    [Discussion]: Quantization in native pytorch for GPUs (Cuda)?
    Is there a way to do quantization in native pytorch for GPUs (Cuda)? I know that TensorRT offers this functionality, but I would prefer working with native pytorch code. I understand from the pytorch docs, https://pytorch.org/docs/stable/quantization.html, that quantization for the GPU is linked to TensorRT. Given that Nvidia GPUs offer quantization for some time now, it's find it difficult to believe that no other solid implementation for quantization other than TensorRT exists. Grateful for any pointer or suggestions. submitted by /u/faschu [link] [comments]  ( 59 min )
    [P] 🗣️ Speechbox - A new library to *unnormalize* your speech.
    Speechbox is built on the premise that Whisper is good enough to pretty much transcribe any English speech. Furthermore, Whisper was trained to predict punctuated and orthographic text. ​ Speechbox leverages Whisper's quality to "unnormalize" audio transcriptions (see examples below) to make them more useful for further downstream applications while guaranteeing that the exact same words are being used. "we are going to the san francisco beach" can have multiple meanings: => We are going to the San Francisco beach! We are going to the San Francisco beach? We are going to the San Francisco beach. ​ Speechbox will pick the correct one for you 😉 ​ 👉 GitHub: https://github.com/huggingface/speechbox 🤗 Demo: https://huggingface.co/spaces/speechbox/whisper-restore-punctuation submitted by /u/pvp239 [link] [comments]  ( 60 min )
    [P] 🚀 AWS launches Fortuna, an open-source library for Uncertainty Quantification
    At AWS we released Fortuna, a library for Uncertainty Quantification. Fortuna supports conformal prediction, Bayesian inference methods and more. Try it out! GitHub stars are very welcome!!! ⭐⭐⭐ Github repo: https://github.com/awslabs/fortuna submitted by /u/gianluca_detommaso [link] [comments]  ( 63 min )
    [R] Is there a one-minute stock market dataset available?
    Found this https://data.nasdaq.com/databases/AS500/data, but it’s $1,000 a year. Anything free? Doesn’t have to be one-minute, just granular submitted by /u/side-8182 [link] [comments]  ( 61 min )
    [P] HF Spaces demo: Crosslingual youtube subtitles to videos with Whisper and DeepL
    Hi, As part of Huggingface whisper finetuning event I created a demo where you can: Download youtube video with a given URL 2. Watch downloaded video in the first video component 3. Run automatic speech recognition on the video using Whisper models from ggerganov https://github.com/ggerganov/whisper.cpp 4. Translate the recognized transcriptions to 26 languages supported by deepL Download generated subtitle files in .srt and .vtt formats 6. Watch the video in another video component with added subtitles You can test it from here --> https://huggingface.co/spaces/RASMUS/Whisper-youtube-crosslingual-subtitles <-- submitted by /u/Finslayer [link] [comments]  ( 66 min )
    [D]There is still no discussion nor response under my ICLR submission after two months. What do you think I should do?
    I have tried to post reminders to both chairs and reviewers, but no one seems to care about my rebuttal and revision. What an exciting experience! submitted by /u/minogame [link] [comments]  ( 65 min )
    [D] Tensorflow v1.15 in google colab not working
    Tensorflow 1.x is no longer supported on google colab. I want to use create_hparams from hparams and it has contrib module that has been removed in Tensorflow 2.x What shall be the alternative to this? submitted by /u/Affectionate_Bite_50 [link] [comments]  ( 64 min )
    [P] liboai - A C++ Library for OpenAI
    I've developed a library for the broader C++ developer and machine learning community to access and make use of the OpenAI API. The library follows a simple, elegant syntax similar in style to that of the OpenAI Python library. The library offers access to each component of the API, be it from Images, Fine-tunes, or Embeddings, it can do it--and with ease. It also possesses utility functionality that allows for out-of-box downloading of generated images, setting of request proxies, and so on. Finally, it contains highly-detailed documentation and code examples, as well as emphasizing secure usage of authorization information in the library. You can find it here: https://github.com/D7EAD/liboai. Enjoy! submitted by /u/throwaway1322482942 [link] [comments]  ( 64 min )
    State of the art models for Question generation [D]
    Hello everyone, I am working on a task to generate questions on some documents, currently manual creation of questions by domain experts is a very time consuming process, could anyone please suggest some good NLP models which can generate human like questions given a text. submitted by /u/arush1836 [link] [comments]  ( 58 min )
    [R] Issues Training CNN To Output Index To Large Array
    I created my own standard CNN. It can run on either C# or compute shader(GPU) in Unity. I am an experienced coder but I am still new to neural networks but I coded my own to help build my understanding of them. The Model: I am attempting to train the network to complete sentences of given text. (this may be the wrong approach). I am training the network with film dialog. I first sort through the dialog and convert the words into indexes for a list of all words in the dataset. The first input layer is 50 (Max of 50 words from user) and the output is just one index which is supposed to be the predicted word. The word indexes are normalized to be from 0-1 based on max word value. Training goes through every sentence word by word, and tells the network the next word after as the answer. The Problem: This is working well in small scale tests, but exponentially falls apart in larger scale models. The reason is that the more total words there are, the more exponentially sensitive the normalized indexes become since it's in the range of 0-1. Which makes training the network near impossible once there are thousands of words. Is there some simple solution I am missing when it comes to outputing sensitive index values? Any helps/ideas would be greatly appreciated. submitted by /u/TheRPGGamerMan [link] [comments]  ( 63 min )
  • Open

    RL Course Project ideas
    Hi guys, I have taken a course (undergrad) on Applied Machine Learning this semester and we have to do a project as part of our course. Our professor has suggested that we do something related to DL/RL since these 2 are quite active these days in research. Can someone please suggest some nice ideas/directions? I am not an absolute beginner in ML/RL and do know basic RL theory and pytorch. Our semester ends by May so a project should be doable in around 3 months. So I'm hoping that the project should be a good learning experience but also good enough to showcase my skills beyond the course (like in my CV and stuff, or possibly some research publication after working on it further). Thanks in advance! submitted by /u/6obama_bin_laden9 [link] [comments]  ( 56 min )
    Let’s learn about Policy Gradient by implementing our first Deep Reinforcement Learning algorithm with PyTorch (Deep Reinforcement Learning Free Course by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the fourth Unit of the Deep Reinforcement Learning Course) 🥳 In this Unit, you’ll learn about Policy-based methods and code your first Deep Reinforcement Learning algorithm from scratch using PyTorch 🔥 You’ll then train this agent to play PixelCopter 🚁 and CartPole. You’ll be then able to improve the implementation with Convolutional Neural Networks. Start Learning now 👉 https://huggingface.co/deep-rl-course/unit4/introduction https://preview.redd.it/f7wpgv0g82aa1.jpg?width=1920&format=pjpg&auto=webp&s=09320316c02da83610e2c75ce18ae76a6c1df301 New year, new resolutions, if you want to start to learn about reinforcement learning, we launched this course, and don’t worry there’s still time and 2023 is the perfect year to start. We wrote an introduction unit to help you get started. You can start learning now 👉 https://huggingface.co/deep-rl-course/unit0/introduction If you have questions or feedback I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 58 min )
    University researching in RL
    I am a Master's student looking for my Erasmus program and I would like to go to a university that does research in the field (I am currently a research assistant at Polimi). It's impossible for me to check every available university for the exchange so I would like to know if you could suggest some universities to look out for. (There are a lot of programs in the EU, some in China and Asia in general, and some in South America, just one in the USA with Florida university, but if you have any idea please tell me) Thanks! submitted by /u/Mysterious-Ad-5721 [link] [comments]  ( 56 min )
    variable action space in test scenario - how to deal with this
    Hi all, I am carrying out an optimisation problem on an RL task on a graph network. the observation space is a node embedding vector of n-dimensions. The reason for this choice is to be able to make the task invariant to graph size at test time. The action space is the number of nodes on the graph (it is a node-removal optimisation task so it can decide on any node to remove on the graph). The issue of course is that at test time, i might be faced with a graph that is either bigger or smaller and thus have an action space that has now changed. I was wondering how people may suggest to overcome this when making the original training phase invariant to the size of the graph? One idea could be to make it such that the environment 'presents' a node and the agent has to decide on whether it should remove or leave alone - however this seems laborious and i would like for the agent to be able to look at the graph and make an immediate decision on which action it takes. I have read several publications of this sort of approach with the advantage being that the task can now be applied to any graph size, but I am having a brain freeze abut how the action space is managed when the agent is presented with a graph of a size it has never seen before. Any advise or help is appreciated. thank you! submitted by /u/amjass12 [link] [comments]  ( 61 min )
    Distribution of actions - exploration distribution connection.
    For an environment I am currently working with I have two non-RL benchmark policies - a rule-based one and an modell-predictive one. The distributions of actions these two policies produce are looking like exponential distributions (see Picture). ​ https://preview.redd.it/2x5n2gn3d0aa1.png?width=578&format=png&auto=webp&s=892975a08a669de14ced228937175361f406c7e9 This made me wondering.. Is there any connection between the selected distribution for exploration (in PPO and SAC for eaxmple) and the desired distribution of actions? Are there any assumption about the distribution of actions when using a Normal Distribution for exploration? Because intuitively I would assume with a Normal Distribution for exploration actions greater and less than mean of the normal should lead to somewhat equal rewards. Otherwise choosing a normal distribution for exploration wouldn't make much sense. Also I suspect given an action distribution like the one in the picture the agent will not be able to learn is unless the standard deviation of the normal distribution becomes very small. What should I do in a case like this? I was thinking about doing a transformation of the action space so the actions are distributed more evenly. Then learn these actions with linear output + normal distribution and do the transformation as part of the environment. Neural Network -> Linear -> Normal -> Transformation -> Environment Instead of : Neural Network -> Tanh -> Normal -> Environment like it's usually done. submitted by /u/flxh13 [link] [comments]  ( 55 min )
  • Open

    Lights! Cameras! Atoms! Scientist Peers Into the Quantum Future
    Editor’s note: This is part of a series profiling people advancing science with high performance computing. Ryan Coffee makes movies of molecules. Their impacts are huge. The senior scientist at the SLAC National Accelerator Laboratory (above) says these visualizations could unlock the secrets of photosynthesis. They’ve already shown how sunlight can cause skin cancer. Long Read article >  ( 6 min )
    UF Provost Joe Glover on Building a Leading AI University
    When NVIDIA co-founder Chris Malachowsky approached University of Florida Provost Joe Glover with the offer of an AI supercomputer, he couldn’t have predicted the transformative impact it would have on the university. In just a short time, UF has become one of the top public colleges in the U.S. and developed a groundbreaking neural network Read article >  ( 5 min )
  • Open

    Strengthening electron-triggered light emission
    A new method can produce a hundredfold increase in light emissions from a type of electron-photon coupling, which is key to electron microscopes and other technologies.  ( 9 min )
  • Open

    Unlocking the power of the ChatGPT revolution: 100 innovative use-cases to try before you …
    2023 UPDATE! Just published a book with 1337 use cases and around 4000 examples.  ( 157 min )
    Python or JavaScript! for ML and DL
    Choose your Programming Language  ( 12 min )
    ChatGPT — CICERO — Twitter Layoffs (Autumn 22)
    The EdgeRunner Agent #Issue 1 Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 22 min )
    Best AI Tools with a Christmas Spirit
    Robots celebrate holidays?  ( 6 min )
  • Open

    MIXRTs: Toward Interpretable Multi-Agent Reinforcement Learning via Mixing Recurrent Soft Decision Trees. (arXiv:2209.07225v2 [cs.LG] UPDATED)
    Multi-agent reinforcement learning (MARL) recently has achieved tremendous success in a wide range of fields. However, with a black-box neural network architecture, existing MARL methods make decisions in an opaque fashion that hinders humans from understanding the learned knowledge and how input observations influence decisions. Our solution is MIXing Recurrent soft decision Trees (MIXRTs), a novel interpretable architecture that can represent explicit decision processes via the root-to-leaf path of decision trees. We introduce a novel recurrent structure in soft decision trees to address partial observability, and estimate joint action values via linearly mixing outputs of recurrent trees based on local observations only. Theoretical analysis shows that MIXRTs guarantees the structural constraint with additivity and monotonicity in factorization. We evaluate MIXRTs on a range of challenging StarCraft II tasks. Experimental results show that our interpretable learning framework obtains competitive performance compared to widely investigated baselines, and delivers more straightforward explanations and domain knowledge of the decision processes.  ( 2 min )
    Large-Scale Traffic Signal Control by a Nash Deep Q-network Approach. (arXiv:2301.00637v1 [cs.GT])
    Reinforcement Learning (RL) is currently one of the most commonly used techniques for traffic signal control (TSC), which can adaptively adjusted traffic signal phase and duration according to real-time traffic data. However, a fully centralized RL approach is beset with difficulties in a multi-network scenario because of exponential growth in state-action space with increasing intersections. Multi-agent reinforcement learning (MARL) can overcome the high-dimension problem by employing the global control of each local RL agent, but it also brings new challenges, such as the failure of convergence caused by the non-stationary Markov Decision Process (MDP). In this paper, we introduce an off-policy nash deep Q-Network (OPNDQN) algorithm, which mitigates the weakness of both fully centralized and MARL approaches. The OPNDQN algorithm solves the problem that traditional algorithms cannot be used in large state-action space traffic models by utilizing a fictitious game approach at each iteration to find the nash equilibrium among neighboring intersections, from which no intersection has incentive to unilaterally deviate. One of main advantages of OPNDQN is to mitigate the non-stationarity of multi-agent Markov process because it considers the mutual influence among neighboring intersections by sharing their actions. On the other hand, for training a large traffic network, the convergence rate of OPNDQN is higher than that of existing MARL approaches because it does not incorporate all state information of each agent. We conduct an extensive experiments by using Simulation of Urban MObility simulator (SUMO), and show the dominant superiority of OPNDQN over several existing MARL approaches in terms of average queue length, episode training reward and average waiting time.  ( 2 min )
    Fundamental Laws of Binary Classification. (arXiv:2205.07589v2 [cs.LG] UPDATED)
    Finding discriminant functions of minimum risk binary classification systems is a novel geometric locus problem -- which requires solving a system of fundamental locus equations of binary classification -- subject to deep-seated statistical laws. We show that a discriminant function of a minimum risk binary classification system is the solution of a locus equation that represents the geometric locus of the decision boundary of the system, wherein the discriminant function is connected to the decision boundary by an exclusive principal eigen-coordinate system -- at which point the discriminant function is represented by a geometric locus of a novel principal eigenaxis -- structured as a dual locus of likelihood components and principal eigenaxis components. We demonstrate that a minimum risk binary classification system acts to jointly minimize its eigenenergy and risk by locating a point of equilibrium, at which point critical minimum eigenenergies exhibited by the system are symmetrically concentrated in such a manner that the novel principal eigenaxis of the system exhibits symmetrical dimensions and densities, so that counteracting and opposing forces and influences of the system are symmetrically balanced with each other -- about the geometric center of the locus of the novel principal eigenaxis -- whereon the statistical fulcrum of the system is located. Thereby, a minimum risk binary classification system satisfies a state of statistical equilibrium -- so that the total allowed eigenenergy and the expected risk exhibited by the system are jointly minimized within the decision space of the system -- at which point the system exhibits the minimum probability of classification error.  ( 3 min )
    Heterogeneous Graph Contrastive Multi-view Learning. (arXiv:2210.00248v2 [cs.LG] UPDATED)
    Inspired by the success of contrastive learning (CL) in computer vision and natural language processing, graph contrastive learning (GCL) has been developed to learn discriminative node representations on graph datasets. However, the development of GCL on Heterogeneous Information Networks (HINs) is still in the infant stage. For example, it is unclear how to augment the HINs without substantially altering the underlying semantics, and how to design the contrastive objective to fully capture the rich semantics. Moreover, early investigations demonstrate that CL suffers from sampling bias, whereas conventional debiasing techniques are empirically shown to be inadequate for GCL. How to mitigate the sampling bias for heterogeneous GCL is another important problem. To address the aforementioned challenges, we propose a novel Heterogeneous Graph Contrastive Multi-view Learning (HGCML) model. In particular, we use metapaths as the augmentation to generate multiple subgraphs as multi-views, and propose a contrastive objective to maximize the mutual information between any pairs of metapath-induced views. To alleviate the sampling bias, we further propose a positive sampling strategy to explicitly select positives for each node via jointly considering semantic and structural information preserved on each metapath view. Extensive experiments demonstrate HGCML consistently outperforms state-of-the-art baselines on five real-world benchmark datasets.  ( 2 min )
    A Snapshot of the Frontiers of Client Selection in Federated Learning. (arXiv:2210.04607v2 [cs.DC] UPDATED)
    Federated learning (FL) has been proposed as a privacy-preserving approach in distributed machine learning. A federated learning architecture consists of a central server and a number of clients that have access to private, potentially sensitive data. Clients are able to keep their data in their local machines and only share their locally trained model's parameters with a central server that manages the collaborative learning process. FL has delivered promising results in real-life scenarios, such as healthcare, energy, and finance. However, when the number of participating clients is large, the overhead of managing the clients slows down the learning. Thus, client selection has been introduced as a strategy to limit the number of communicating parties at every step of the process. Since the early na\"{i}ve random selection of clients, several client selection methods have been proposed in the literature. Unfortunately, given that this is an emergent field, there is a lack of a taxonomy of client selection methods, making it hard to compare approaches. In this paper, we propose a taxonomy of client selection in Federated Learning that enables us to shed light on current progress in the field and identify potential areas of future research in this promising area of machine learning.  ( 2 min )
    Online Training Through Time for Spiking Neural Networks. (arXiv:2210.04195v2 [cs.NE] UPDATED)
    Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency. Particularly, backpropagation through time (BPTT) with surrogate gradients (SG) is popularly used to achieve high performance in a very small number of time steps. However, it is at the cost of large memory consumption for training, lack of theoretical clarity for optimization, and inconsistency with the online property of biological learning and rules on neuromorphic hardware. Other works connect spike representations of SNNs with equivalent artificial neural network formulation and train SNNs by gradients from equivalent mappings to ensure descent directions. But they fail to achieve low latency and are also not online. In this work, we propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning by tracking presynaptic activities and leveraging instantaneous loss and gradients. Meanwhile, we theoretically analyze and prove that gradients of OTTT can provide a similar descent direction for optimization as gradients based on spike representations under both feedforward and recurrent conditions. OTTT only requires constant training memory costs agnostic to time steps, avoiding the significant memory costs of BPTT for GPU training. Furthermore, the update rule of OTTT is in the form of three-factor Hebbian learning, which could pave a path for online on-chip learning. With OTTT, it is the first time that two mainstream supervised SNN training methods, BPTT with SG and spike representation-based training, are connected, and meanwhile in a biologically plausible form. Experiments on CIFAR-10, CIFAR-100, ImageNet, and CIFAR10-DVS demonstrate the superior performance of our method on large-scale static and neuromorphic datasets in small time steps.  ( 2 min )
    Automated dysgraphia detection by deep learning with SensoGrip. (arXiv:2210.07659v2 [cs.LG] UPDATED)
    Dysgraphia, a handwriting learning disability, has a serious negative impact on children's academic results, daily life and overall wellbeing. Early detection of dysgraphia allows for an early start of a targeted intervention. Several studies have investigated dysgraphia detection by machine learning algorithms using a digital tablet. However, these studies deployed classical machine learning algorithms with manual feature extraction and selection as well as binary classification: either dysgraphia or no dysgraphia. In this work, we investigated fine grading of handwriting capabilities by predicting SEMS score (between 0 and 12) with deep learning. Our approach provide accuracy more than 99% and root mean square error lower than one, with automatic instead of manual feature extraction and selection. Furthermore, we used smart pen called SensoGrip, a pen equipped with sensors to capture handwriting dynamics, instead of a tablet, enabling writing evaluation in more realistic scenarios.  ( 2 min )
    Spatial-Temporal Meta-path Guided Explainable Crime Prediction. (arXiv:2205.01901v3 [cs.LG] UPDATED)
    Exposure to crime and violence can harm individuals' quality of life and the economic growth of communities. In light of the rapid development in machine learning, there is a rise in the need to explore automated solutions to prevent crimes. With the increasing availability of both fine-grained urban and public service data, there is a recent surge in fusing such cross-domain information to facilitate crime prediction. By capturing the information about social structure, environment, and crime trends, existing machine learning predictive models have explored the dynamic crime patterns from different views. However, these approaches mostly convert such multi-source knowledge into implicit and latent representations (e.g., learned embeddings of districts), making it still a challenge to investigate the impacts of explicit factors for the occurrences of crimes behind the scenes. In this paper, we present a Spatial-Temporal Metapath guided Explainable Crime prediction (STMEC) framework to capture dynamic patterns of crime behaviours and explicitly characterize how the environmental and social factors mutually interact to produce the forecasts. Extensive experiments show the superiority of STMEC compared with other advanced spatiotemporal models, especially in predicting felonies (e.g., robberies and assaults with dangerous weapons).  ( 2 min )
    TriNet: stabilizing self-supervised learning from complete or slow collapse. (arXiv:2301.00656v1 [eess.AS])
    Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. Our experimental results show that the proposed method notably stabilizes and accelerates pre-training and achieves a relative word error rate reduction (WERR) of 5.32% compared to the state-of-the-art (SOTA) Data2vec for a downstream benchmark ASR task. We will release our code at https://github.com/tencent-ailab/.  ( 2 min )
    Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets. (arXiv:2301.00595v1 [cs.LG])
    Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.  ( 2 min )
    FlatENN: Train Flat for Enhanced Fault Tolerance of Quantized Deep Neural Networks. (arXiv:2301.00675v1 [cs.LG])
    Model compression via quantization and sparsity enhancement has gained an immense interest to enable the deployment of deep neural networks (DNNs) in resource-constrained edge environments. Although these techniques have shown promising results in reducing the energy, latency and memory requirements of the DNNs, their performance in non-ideal real-world settings (such as in the presence of hardware faults) is yet to be completely understood. In this paper, we investigate the impact of bit-flip and stuck-at faults on activation-sparse quantized DNNs (QDNNs). We show that a high level of activation sparsity comes at the cost of larger vulnerability to faults. For instance, activation-sparse QDNNs exhibit up to 17.32% lower accuracy than the standard QDNNs. We also establish that one of the major cause of the degraded accuracy is sharper minima in the loss landscape for activation-sparse QDNNs, which makes them more sensitive to perturbations in the weight values due to faults. Based on this observation, we propose the mitigation of the impact of faults by employing a sharpness-aware quantization (SAQ) training scheme. The activation-sparse and standard QDNNs trained with SAQ have up to 36.71% and 24.76% higher inference accuracy, respectively compared to their conventionally trained equivalents. Moreover, we show that SAQ-trained activation-sparse QDNNs show better accuracy in faulty settings than standard QDNNs trained conventionally. Thus the proposed technique can be instrumental in achieving sparsity-related energy/latency benefits without compromising on fault tolerance.  ( 2 min )
    Deep Recurrent Learning Through Long Short Term Memory and TOPSIS. (arXiv:2301.00693v1 [cs.SE])
    Enterprise resource planning (ERP) software brings resources, data together to keep software-flow within business processes in a company. However, cloud computing's cheap, easy and quick management promise pushes business-owners for a transition from monolithic to a data-center/cloud based ERP. Since cloud-ERP development involves a cyclic process, namely planning, implementing, testing and upgrading, its adoption is realized as a deep recurrent neural network problem. Eventually, a classification algorithm based on long short term memory (LSTM) and TOPSIS is proposed to identify and rank, respectively, adoption features. Our theoretical model is validated over a reference model by articulating key players, services, architecture, functionalities. Qualitative survey is conducted among users by considering technology, innovation and resistance issues, to formulate hypotheses on key adoption factors.  ( 2 min )
    Sparse neural networks with skip-connections for nonlinear system identification. (arXiv:2301.00582v1 [eess.SY])
    Data-driven models such as neural networks are being applied more and more to safety-critical applications, such as the modeling and control of cyber-physical systems. Despite the flexibility of the approach, there are still concerns about the safety of these models in this context, as well as the need for large amounts of potentially expensive data. In particular, when long-term predictions are needed or frequent measurements are not available, the open-loop stability of the model becomes important. However, it is difficult to make such guarantees for complex black-box models such as neural networks, and prior work has shown that model stability is indeed an issue. In this work, we consider an aluminum extraction process where measurements of the internal state of the reactor are time-consuming and expensive. We model the process using neural networks and investigate the role of including skip connections in the network architecture as well as using l1 regularization to induce sparse connection weights. We demonstrate that these measures can greatly improve both the accuracy and the stability of the models for datasets of varying sizes.  ( 2 min )
    An Efficient Hierarchical Kriging Modeling Method for High-dimension Multi-fidelity Problems. (arXiv:2301.00216v1 [cs.LG])
    Multi-fidelity Kriging model is a promising technique in surrogate-based design as it can balance the model accuracy and cost of sample preparation by fusing low- and high-fidelity data. However, the cost for building a multi-fidelity Kriging model increases significantly with the increase of the problem dimension. To attack this issue, an efficient Hierarchical Kriging modeling method is proposed. In building the low-fidelity model, the maximal information coefficient is utilized to calculate the relative value of the hyperparameter. With this, the maximum likelihood estimation problem for determining the hyperparameters is transformed as a one-dimension optimization problem, which can be solved in an efficient manner and thus improve the modeling efficiency significantly. A local search is involved further to exploit the search space of hyperparameters to improve the model accuracy. The high-fidelity model is built in a similar manner with the hyperparameter of the low-fidelity model served as the relative value of the hyperparameter for high-fidelity model. The performance of the proposed method is compared with the conventional tuning strategy, by testing them over ten analytic problems and an engineering problem of modeling the isentropic efficiency of a compressor rotor. The empirical results demonstrate that the modeling time of the proposed method is reduced significantly without sacrificing the model accuracy. For the modeling of the isentropic efficiency of the compressor rotor, the cost saving associated with the proposed method is about 90% compared with the conventional strategy. Meanwhile, the proposed method achieves higher accuracy.  ( 2 min )
    Targeted Phishing Campaigns using Large Scale Language Models. (arXiv:2301.00665v1 [cs.CL])
    In this research, we aim to explore the potential of natural language models (NLMs) such as GPT-3 and GPT-2 to generate effective phishing emails. Phishing emails are fraudulent messages that aim to trick individuals into revealing sensitive information or taking actions that benefit the attackers. We propose a framework for evaluating the performance of NLMs in generating these types of emails based on various criteria, including the quality of the generated text, the ability to bypass spam filters, and the success rate of tricking individuals. Our evaluations show that NLMs are capable of generating phishing emails that are difficult to detect and that have a high success rate in tricking individuals, but their effectiveness varies based on the specific NLM and training data used. Our research indicates that NLMs could have a significant impact on the prevalence of phishing attacks and emphasizes the need for further study on the ethical and security implications of using NLMs for malicious purposes.  ( 2 min )
    Lossy Compression with Gaussian Diffusion. (arXiv:2206.08889v2 [stat.ML] UPDATED)
    We consider a novel lossy compression approach based on unconditional diffusion generative models, which we call DiffC. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as theoretic bounds for general distributions. Furthermore, we prove that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.  ( 2 min )
    Extended Feature Space-Based Automatic Melanoma Detection System. (arXiv:2209.04588v2 [cs.LG] UPDATED)
    Melanoma is the deadliest form of skin cancer. Uncontrollable growth of melanocytes leads to melanoma. Melanoma has been growing wildly in the last few decades. In recent years, the detection of melanoma using image processing techniques has become a dominant research field. The Automatic Melanoma Detection System (AMDS) helps to detect melanoma based on image processing techniques by accepting infected skin area images as input. A single lesion image is a source of multiple features. Therefore, It is crucial to select the appropriate features from the image of the lesion in order to increase the accuracy of AMDS. For melanoma detection, all extracted features are not important. Some of the extracted features are complex and require more computation tasks, which impacts the classification accuracy of AMDS. The feature extraction phase of AMDS exhibits more variability, therefore it is important to study the behaviour of AMDS using individual and extended feature extraction approaches. A novel algorithm ExtFvAMDS is proposed for the calculation of Extended Feature Vector Space. The six models proposed in the comparative study revealed that the HSV feature vector space for automatic detection of melanoma using Ensemble Bagged Tree classifier on Med-Node Dataset provided 99% AUC, 95.30% accuracy, 94.23% sensitivity, and 96.96% specificity.  ( 2 min )
    UltraProp: Principled and Explainable Propagation on Large Graphs. (arXiv:2301.00270v1 [cs.SI])
    Given a large graph with few node labels, how can we (a) identify the mixed network-effect of the graph and (b) predict the unknown labels accurately and efficiently? This work proposes Network Effect Analysis (NEA) and UltraProp, which are based on two insights: (a) the network-effect (NE) insight: a graph can exhibit not only one of homophily and heterophily, but also both or none in a label-wise manner, and (b) the neighbor-differentiation (ND) insight: neighbors have different degrees of influence on the target node based on the strength of connections. NEA provides a statistical test to check whether a graph exhibits network-effect or not, and surprisingly discovers the absence of NE in many real-world graphs known to have heterophily. UltraProp solves the node classification problem with notable advantages: (a) Accurate, thanks to the network-effect (NE) and neighbor-differentiation (ND) insights; (b) Explainable, precisely estimating the compatibility matrix; (c) Scalable, being linear with the input size and handling graphs with millions of nodes; and (d) Principled, with closed-form formula and theoretical guarantee. Applied on eight real-world graph datasets, UltraProp outperforms top competitors in terms of accuracy and run time, requiring only stock CPU servers. On a large real-world graph with 1.6M nodes and 22.3M edges, UltraProp achieves more than 9 times speedup (12 minutes vs. 2 hours) compared to most competitors.  ( 2 min )
    ReSQueing Parallel and Private Stochastic Convex Optimization. (arXiv:2301.00457v1 [math.OC])
    We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{-2/3} + \epsilon_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. We give an $(\epsilon_{\text{dp}}, \delta)$-differentially private algorithm which, given $n$ samples of Lipschitz loss functions, obtains near-optimal optimization error and makes $\min(n, n^2\epsilon_{\text{dp}}^2 d^{-1}) + \min(n^{4/3}\epsilon_{\text{dp}}^{1/3}, (nd)^{2/3}\epsilon_{\text{dp}}^{-1})$ queries to the gradients of these functions. In the regime $d \le n \epsilon_{\text{dp}}^{2}$, where privacy comes at no cost in terms of the optimal loss up to constants, our algorithm uses $n + (nd)^{2/3}\epsilon_{\text{dp}}^{-1}$ queries and improves recent advancements of [KLL21, AFKT21]. In the moderately low-dimensional setting $d \le \sqrt n \epsilon_{\text{dp}}^{3/2}$, our query complexity is near-linear.  ( 2 min )
    Reinforcement Learning-Based Cooperative P2P Power Trading between DC Nanogrid Clusters with Wind and PV Energy Resources. (arXiv:2209.07744v2 [cs.LG] UPDATED)
    In replacing fossil fuels with renewable energy resources for carbon neutrality, the unbalanced resource production of intermittent wind and photovoltaic (PV) power is a critical issue for peer-to-peer (P2P) power trading. To address this issue, a reinforcement learning (RL) technique is introduced in this paper. For RL, a graph convolutional network (GCN) and a bi-directional long short-term memory (Bi-LSTM) network are jointly applied to P2P power trading between nanogrid clusters, based on cooperative game theory. The flexible and reliable DC nanogrid is suitable for integrating renewable energy for a distribution system. Each local nanogrid cluster takes the position of prosumer, focusing on power production and consumption simultaneously. For the power management of nanogrid cluster, multi-objective optimization is applied to each local nanogrid cluster with the Internet of Things (IoT) technology. Charging/discharging of an electric vehicle (EV) is executed considering the intermittent characteristics of wind and PV power production. RL algorithms, such as GCN- convolutional neural network (CNN) layers for deep Q-learning network (DQN), GCN-LSTM layers for deep recurrent Q-learning network (DRQN), GCN-Bi-LSTM layers for DRQN, and GCN-Bi-LSTM layers for proximal policy optimization (PPO), are used for simulations. Consequently, the cooperative P2P power trading system maximizes the profit by considering the time of use (ToU) tariff-based electricity cost and the system marginal price (SMP), and minimizes the amount of grid power consumption. Power management of nanogrid clusters with P2P power trading is simulated on a distribution test feeder in real time, and the proposed GCN-Bi-LSTM-PPO technique achieving the lowest electricity cost among the RL algorithms used for comparison reduces the electricity cost by 36.7%, averaging over nanogrid clusters.  ( 3 min )
    Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution. (arXiv:2204.13545v2 [cs.LG] UPDATED)
    Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA HTS is required to enrich single-cell data meaningfully. We introduce chemCPA, a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with an architecture surgery for transfer learning and demonstrate how training on existing bulk RNA HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating drug discovery.  ( 2 min )
    DRSOM: A Dimension Reduced Second-Order Method. (arXiv:2208.00208v2 [math.OC] UPDATED)
    In this paper, we propose a Dimension-Reduced Second-Order Method (DRSOM) for convex and nonconvex (unconstrained) optimization. Under a trust-region-like framework, our method preserves the convergence of the second-order method while using only curvature information in a few directions. Consequently, the computational overhead of our method remains comparable to the first-order such as the gradient descent method. Theoretically, we show that the method has a local quadratic convergence and a global convergence rate of $O(\epsilon^{-3/2})$ to satisfy the first-order and second-order conditions if the subspace satisfies a commonly adopted approximated Hessian assumption. We further show that this assumption can be removed if we perform one \emph{corrector step} (using a Krylov method, for example) periodically at the end stage of the algorithm. The applicability and performance of DRSOM are exhibited by various computational experiments, particularly in machine learning and deep learning. For neural networks, our preliminary implementation seems to gain computational advantages in terms of training accuracy and iteration complexity over state-of-the-art first-order methods such as SGD and ADAM.  ( 2 min )
    Learning To Rank Diversely. (arXiv:2210.07774v2 [cs.IR] UPDATED)
    Airbnb is a two-sided marketplace, bringing together hosts who own listings for rent, with prospective guests from around the globe. Applying neural network-based learning to rank techniques has led to significant improvements in matching guests with hosts. These improvements in ranking were driven by a core strategy: order the listings by their estimated booking probabilities, then iterate on techniques to make these booking probability estimates more and more accurate. Embedded implicitly in this strategy was an assumption that the booking probability of a listing could be determined independently of other listings in search results. In this paper we discuss how this assumption, pervasive throughout the commonly-used learning to rank frameworks, is false. We provide a theoretical foundation correcting this assumption, followed by efficient neural network architectures based on the theory. Explicitly accounting for possible similarities between listings, and reducing them to diversify the search results generated strong positive impact. We discuss these metric wins as part of the online A/B tests of the theory. Our method provides a practical way to diversify search results for large-scale production ranking systems.  ( 2 min )
    A Genetic Algorithm-based Framework for Learning Statistical Power Manifold. (arXiv:2209.00215v2 [stat.CO] UPDATED)
    Statistical power is a measure of the replicability of a categorical hypothesis test. Formally, it is the probability of detecting an effect, if there is a true effect present in the population. Hence, optimizing statistical power as a function of some parameters of a hypothesis test is desirable. However, for most hypothesis tests, the explicit functional form of statistical power for individual model parameters is unknown; but calculating power for a given set of values of those parameters is possible using simulated experiments. These simulated experiments are usually computationally expensive. Hence, developing the entire statistical power manifold using simulations can be very time-consuming. We propose a novel genetic algorithm-based framework for learning statistical power manifolds. For a multiple linear regression $F$-test, we show that the proposed algorithm/framework learns the statistical power manifold much faster as compared to a brute-force approach as the number of queries to the power oracle is significantly reduced. We also show that the quality of learning the manifold improves as the number of iterations increases for the genetic algorithm. Such tools are useful for evaluating statistical power trade-offs when researchers have little information regarding a priori `best guesses' of primary effect sizes of interest or how sampling variability in non-primary effects impacts power for primary ones.  ( 2 min )
    Object Representations as Fixed Points: Training Iterative Refinement Algorithms with Implicit Differentiation. (arXiv:2207.00787v3 [cs.LG] UPDATED)
    Iterative refinement -- start with a random guess, then iteratively improve the guess -- is a useful paradigm for representation learning because it offers a way to break symmetries among equally plausible explanations for the data. This property enables the application of such methods to infer representations of sets of entities, such as objects in physical scenes, structurally resembling clustering algorithms in latent space. However, most prior works differentiate through the unrolled refinement process, which can make optimization challenging. We observe that such methods can be made differentiable by means of the implicit function theorem, and develop an implicit differentiation approach that improves the stability and tractability of training by decoupling the forward and backward passes. This connection enables us to apply advances in optimizing implicit layers to not only improve the optimization of the slot attention module in SLATE, a state-of-the-art method for learning entity representations, but do so with constant space and time complexity in backpropagation and only one additional line of code.  ( 2 min )
    Theory of Machine Learning with Limited Data. (arXiv:2206.07586v3 [cs.AI] UPDATED)
    Application of machine learning may be understood as deriving new knowledge for practical use through explaining accumulated observations, training set. Peirce used the term abduction for this kind of inference. Here I formalize the concept of abduction for real valued hypotheses, and show that 14 of the most popular textbook ML learners (every learner I tested), covering classification, regression and clustering, implement this concept of abduction inference. The approach is proposed as an alternative to statistical learning theory, which requires an impractical assumption of indefinitely increasing training set for its justification.  ( 2 min )
    Tree ensemble kernels for Bayesian optimization with known constraints over mixed-feature spaces. (arXiv:2207.00879v3 [stat.ML] UPDATED)
    Tree ensembles can be well-suited for black-box optimization tasks such as algorithm tuning and neural architecture search, as they achieve good predictive performance with little or no manual tuning, naturally handle discrete feature spaces, and are relatively insensitive to outliers in the training data. Two well-known challenges in using tree ensembles for black-box optimization are (i) effectively quantifying model uncertainty for exploration and (ii) optimizing over the piece-wise constant acquisition function. To address both points simultaneously, we propose using the kernel interpretation of tree ensembles as a Gaussian Process prior to obtain model variance estimates, and we develop a compatible optimization formulation for the acquisition function. The latter further allows us to seamlessly integrate known constraints to improve sampling efficiency by considering domain-knowledge in engineering settings and modeling search space symmetries, e.g., hierarchical relationships in neural architecture search. Our framework performs as well as state-of-the-art methods for unconstrained black-box optimization over continuous/discrete features and outperforms competing methods for problems combining mixed-variable feature spaces and known input constraints.  ( 2 min )
    DM$^2$: Decentralized Multi-Agent Reinforcement Learning for Distribution Matching. (arXiv:2206.00233v2 [cs.MA] UPDATED)
    Current approaches to multi-agent cooperation rely heavily on centralized mechanisms or explicit communication protocols to ensure convergence. This paper studies the problem of distributed multi-agent learning without resorting to centralized components or explicit communication. It examines the use of distribution matching to facilitate the coordination of independent agents. In the proposed scheme, each agent independently minimizes the distribution mismatch to the corresponding component of a target visitation distribution. The theoretical analysis shows that under certain conditions, each agent minimizing its individual distribution mismatch allows the convergence to the joint policy that generated the target distribution. Further, if the target distribution is from a joint policy that optimizes a cooperative task, the optimal policy for a combination of this task reward and the distribution matching reward is the same joint policy. This insight is used to formulate a practical algorithm (DM$^2$), in which each individual agent matches a target distribution derived from concurrently sampled trajectories from a joint expert policy. Experimental validation on the StarCraft domain shows that combining (1) a task reward, and (2) a distribution matching reward for expert demonstrations for the same task, allows agents to outperform a naive distributed baseline. Additional experiments probe the conditions under which expert demonstrations need to be sampled to obtain the learning benefits.  ( 2 min )
    Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following. (arXiv:2301.00676v1 [cs.LG])
    Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.
    Generalizable Black-Box Adversarial Attack with Meta Learning. (arXiv:2301.00364v1 [cs.LG])
    In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
    Capturing positive network attributes during the estimation of recursive logit models: A prism-based approach. (arXiv:2204.01215v3 [econ.EM] UPDATED)
    Although the recursive logit (RL) model has been recently popular and has led to many applications and extensions, an important numerical issue with respect to the computation of value functions remains unsolved. This issue is particularly significant for model estimation, during which the parameters are updated every iteration and may violate the feasibility condition of the value function. To solve this numerical issue of the value function in the model estimation, this study performs an extensive analysis of a prism-constrained RL (Prism-RL) model proposed by Oyama and Hato (2019), which has a path set constrained by the prism defined based upon a state-extended network representation. The numerical experiments have shown two important properties of the Prism-RL model for parameter estimation. First, the prism-based approach enables estimation regardless of the initial and true parameter values, even in cases where the original RL model cannot be estimated due to the numerical problem. We also successfully captured a positive effect of the presence of street green on pedestrian route choice in a real application. Second, the Prism-RL model achieved better fit and prediction performance than the RL model, by implicitly restricting paths with large detour or many loops. Defining the prism-based path set in a data-oriented manner, we demonstrated the possibility of the Prism-RL model describing more realistic route choice behavior. The capture of positive network attributes while retaining the diversity of path alternatives is important in many applications such as pedestrian route choice and sequential destination choice behavior, and thus the prism-based approach significantly extends the practical applicability of the RL model.  ( 2 min )
    Stochastic Variable Metric Proximal Gradient with variance reduction for non-convex composite optimization. (arXiv:2301.00631v1 [cs.LG])
    This paper introduces a novel algorithm, the Perturbed Proximal Preconditioned SPIDER algorithm (3P-SPIDER), designed to solve finite sum non-convex composite optimization. It is a stochastic Variable Metric Forward-Backward algorithm, which allows approximate preconditioned forward operator and uses a variable metric proximity operator as the backward operator; it also proposes a mini-batch strategy with variance reduction to address the finite sum setting. We show that 3P-SPIDER extends some Stochastic preconditioned Gradient Descent-based algorithms and some Incremental Expectation Maximization algorithms to composite optimization and to the case the forward operator can not be computed in closed form. We also provide an explicit control of convergence in expectation of 3P-SPIDER, and study its complexity in order to satisfy the epsilon-approximate stationary condition. Our results are the first to combine the composite non-convex optimization setting, a variance reduction technique to tackle the finite sum setting by using a minibatch strategy and, to allow deterministic or random approximations of the preconditioned forward operator. Finally, through an application to inference in a logistic regression model with random effects, we numerically compare 3P-SPIDER to other stochastic forward-backward algorithms and discuss the role of some design parameters of 3P-SPIDER.
    Smooth Mathematical Function from Compact Neural Networks. (arXiv:2301.00181v1 [cs.NE])
    This is paper for the smooth function approximation by neural networks (NN). Mathematical or physical functions can be replaced by NN models through regression. In this study, we get NNs that generate highly accurate and highly smooth function, which only comprised of a few weight parameters, through discussing a few topics about regression. First, we reinterpret inside of NNs for regression; consequently, we propose a new activation function--integrated sigmoid linear unit (ISLU). Then special charateristics of metadata for regression, which is different from other data like image or sound, is discussed for improving the performance of neural networks. Finally, the one of a simple hierarchical NN that generate models substituting mathematical function is presented, and the new batch concept ``meta-batch" which improves the performance of NN several times more is introduced. The new activation function, meta-batch method, features of numerical data, meta-augmentation with metaparameters, and a structure of NN generating a compact multi-layer perceptron(MLP) are essential in this study.
    Point Cloud-based Proactive Link Quality Prediction for Millimeter-wave Communications. (arXiv:2301.00752v1 [cs.NI])
    This study demonstrates the feasibility of point cloud-based proactive link quality prediction for millimeter-wave (mmWave) communications. Image-based methods to quantitatively and deterministically predict future received signal strength using machine learning from time series of depth images to mitigate the human body line-of-sight (LOS) path blockage in mmWave communications have been proposed. However, image-based methods have been limited in applicable environments because camera images may contain private information. Thus, this study demonstrates the feasibility of using point clouds obtained from light detection and ranging (LiDAR) for the mmWave link quality prediction. Point clouds represent three-dimensional (3D) spaces as a set of points and are sparser and less likely to contain sensitive information than camera images. Additionally, point clouds provide 3D position and motion information, which is necessary for understanding the radio propagation environment involving pedestrians. This study designs the mmWave link quality prediction method and conducts two experimental evaluations using different types of point clouds obtained from LiDAR and depth cameras, as well as different numerical indicators of link quality, received signal strength and throughput. Based on these experiments, our proposed method can predict future large attenuation of mmWave link quality due to LOS blockage by human bodies, therefore our point cloud-based method can be an alternative to image-based methods.
    Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels. (arXiv:2301.00545v1 [cs.LG])
    A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this paper, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR. Our code and pre-trained models will be released.  ( 2 min )
    Local Differential Privacy for Sequential Decision Making in a Changing Environment. (arXiv:2301.00561v1 [cs.LG])
    We study the problem of preserving privacy while still providing high utility in sequential decision making scenarios in a changing environment. We consider abruptly changing environment: the environment remains constant during periods and it changes at unknown time instants. To formulate this problem, we propose a variant of multi-armed bandits called non-stationary stochastic corrupt bandits. We construct an algorithm called SW-KLUCB-CF and prove an upper bound on its utility using the performance measure of regret. The proven regret upper bound for SW-KLUCB-CF is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we present a provably optimal mechanism which can guarantee the desired level of local differential privacy while providing high utility.  ( 2 min )
    In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise. (arXiv:2301.00524v1 [cs.CV])
    The performance of the Deep Learning (DL) models depends on the quality of labels. In some areas, the involvement of human annotators may lead to noise in the data. When these corrupted labels are blindly regarded as the ground truth (GT), DL models suffer from performance deficiency. This paper presents a method that aims to learn a confident model in the presence of noisy labels. This is done in conjunction with estimating the uncertainty of multiple annotators. We robustly estimate the predictions given only the noisy labels by adding entropy or information-based regularizer to the classifier network. We conduct our experiments on a noisy version of MNIST, CIFAR-10, and FMNIST datasets. Our empirical results demonstrate the robustness of our method as it outperforms or performs comparably to other state-of-the-art (SOTA) methods. In addition, we evaluated the proposed method on the curated dataset, where the noise type and level of various annotators depend on the input image style. We show that our approach performs well and is adept at learning annotators' confusion. Moreover, we demonstrate how our model is more confident in predicting GT than other baselines. Finally, we assess our approach for segmentation problem and showcase its effectiveness with experiments.  ( 2 min )
    A contrastive learning approach for individual re-identification in a wild fish population. (arXiv:2301.00596v1 [cs.CV])
    In both terrestrial and marine ecology, physical tagging is a frequently used method to study population dynamics and behavior. However, such tagging techniques are increasingly being replaced by individual re-identification using image analysis. This paper introduces a contrastive learning-based model for identifying individuals. The model uses the first parts of the Inception v3 network, supported by a projection head, and we use contrastive learning to find similar or dissimilar image pairs from a collection of uniform photographs. We apply this technique for corkwing wrasse, Symphodus melops, an ecologically and commercially important fish species. Photos are taken during repeated catches of the same individuals from a wild population, where the intervals between individual sightings might range from a few days to several years. Our model achieves a one-shot accuracy of 0.35, a 5-shot accuracy of 0.56, and a 100-shot accuracy of 0.88, on our dataset.  ( 2 min )
    Data-Driven Optimization of Directed Information over Discrete Alphabets. (arXiv:2301.00621v1 [cs.IT])
    Directed information (DI) is a fundamental measure for the study and analysis of sequential stochastic models. In particular, when optimized over input distributions it characterizes the capacity of general communication channels. However, analytic computation of DI is typically intractable and existing optimization techniques over discrete input alphabets require knowledge of the channel model, which renders them inapplicable when only samples are available. To overcome these limitations, we propose a novel estimation-optimization framework for DI over discrete input spaces. We formulate DI optimization as a Markov decision process and leverage reinforcement learning techniques to optimize a deep generative model of the input process probability mass function (PMF). Combining this optimizer with the recently developed DI neural estimator, we obtain an end-to-end estimation-optimization algorithm which is applied to estimating the (feedforward and feedback) capacity of various discrete channels with memory. Furthermore, we demonstrate how to use the optimized PMF model to (i) obtain theoretical bounds on the feedback capacity of unifilar finite-state channels; and (ii) perform probabilistic shaping of constellations in the peak power-constrained additive white Gaussian noise channel.  ( 2 min )
    Posterior Collapse and Latent Variable Non-identifiability. (arXiv:2301.00537v1 [stat.ML])
    Variational autoencoders model high-dimensional data by positing low-dimensional latent variables that are mapped through a flexible distribution parametrized by a neural network. Unfortunately, variational autoencoders often suffer from posterior collapse: the posterior of the latent variables is equal to its prior, rendering the variational autoencoder useless as a means to produce meaningful representations. Existing approaches to posterior collapse often attribute it to the use of neural networks or optimization issues due to variational approximation. In this paper, we consider posterior collapse as a problem of latent variable non-identifiability. We prove that the posterior collapses if and only if the latent variables are non-identifiable in the generative model. This fact implies that posterior collapse is not a phenomenon specific to the use of flexible distributions or approximate inference. Rather, it can occur in classical probabilistic models even with exact inference, which we also demonstrate. Based on these results, we propose a class of latent-identifiable variational autoencoders, deep generative models which enforce identifiability without sacrificing flexibility. This model class resolves the problem of latent variable non-identifiability by leveraging bijective Brenier maps and parameterizing them with input convex neural networks, without special variational inference objectives or optimization tricks. Across synthetic and real datasets, latent-identifiable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data.  ( 2 min )
    Learning to Maximize Mutual Information for Dynamic Feature Selection. (arXiv:2301.00557v1 [cs.LG])
    Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.  ( 2 min )
    Dynamically Modular and Sparse General Continual Learning. (arXiv:2301.00620v1 [cs.CV])
    Real-world applications often require learning continuously from a stream of data under ever-changing conditions. When trying to learn from such non-stationary data, deep neural networks (DNNs) undergo catastrophic forgetting of previously learned information. Among the common approaches to avoid catastrophic forgetting, rehearsal-based methods have proven effective. However, they are still prone to forgetting due to task-interference as all parameters respond to all tasks. To counter this, we take inspiration from sparse coding in the brain and introduce dynamic modularity and sparsity (Dynamos) for rehearsal-based general continual learning. In this setup, the DNN learns to respond to stimuli by activating relevant subsets of neurons. We demonstrate the effectiveness of Dynamos on multiple datasets under challenging continual learning evaluation protocols. Finally, we show that our method learns representations that are modular and specialized, while maintaining reusability by activating subsets of neurons with overlaps corresponding to the similarity of stimuli.  ( 2 min )
    Model-Driven Deep Learning for Non-Coherent Massive Machine-Type Communications. (arXiv:2301.00516v1 [cs.IT])
    In this paper, we investigate the joint device activity and data detection in massive machine-type communications (mMTC) with a one-phase non-coherent scheme, where data bits are embedded in the pilot sequences and the base station simultaneously detects active devices and their embedded data bits without explicit channel estimation. Due to the correlated sparsity pattern introduced by the non-coherent transmission scheme, the traditional approximate message passing (AMP) algorithm cannot achieve satisfactory performance. Therefore, we propose a deep learning (DL) modified AMP network (DL-mAMPnet) that enhances the detection performance by effectively exploiting the pilot activity correlation. The DL-mAMPnet is constructed by unfolding the AMP algorithm into a feedforward neural network, which combines the principled mathematical model of the AMP algorithm with the powerful learning capability, thereby benefiting from the advantages of both techniques. Trainable parameters are introduced in the DL-mAMPnet to approximate the correlated sparsity pattern and the large-scale fading coefficient. Moreover, a refinement module is designed to further advance the performance by utilizing the spatial feature caused by the correlated sparsity pattern. Simulation results demonstrate that the proposed DL-mAMPnet can significantly outperform traditional algorithms in terms of the symbol error rate performance.  ( 2 min )
    EmoGator: A New Open Source Vocal Burst Dataset with Baseline Machine Learning Classification Methodologies. (arXiv:2301.00508v1 [cs.SD])
    Vocal Bursts -- short, non-speech vocalizations that convey emotions, such as laughter, cries, sighs, moans, and groans -- are an often-overlooked aspect of speech emotion recognition, but an important aspect of human vocal communication. One barrier to study of these interesting vocalizations is a lack of large datasets. I am pleased to introduce the EmoGator dataset, which consists of 32,040 samples from 365 speakers, 16.91 hours of audio; each sample classified into one of 30 distinct emotion categories by the speaker. Several different approaches to construct classifiers to identify emotion categories will be discussed, and directions for future research will be suggested. Data set is available for download from https://github.com/fredbuhl/EmoGator.  ( 2 min )
    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting. (arXiv:2301.00493v1 [cs.CV])
    We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.  ( 2 min )
    Deep Correlation-Aware Kernelized Autoencoders for Anomaly Detection in Cybersecurity. (arXiv:2301.00462v1 [cs.LG])
    Unsupervised learning-based anomaly detection in latent space has gained importance since discriminating anomalies from normal data becomes difficult in high-dimensional space. Both density estimation and distance-based methods to detect anomalies in latent space have been explored in the past. These methods prove that retaining valuable properties of input data in latent space helps in the better reconstruction of test data. Moreover, real-world sensor data is skewed and non-Gaussian in nature, making mean-based estimators unreliable for skewed data. Again, anomaly detection methods based on reconstruction error rely on Euclidean distance, which does not consider useful correlation information in the feature space and also fails to accurately reconstruct the data when it deviates from the training distribution. In this work, we address the limitations of reconstruction error-based autoencoders and propose a kernelized autoencoder that leverages a robust form of Mahalanobis distance (MD) to measure latent dimension correlation to effectively detect both near and far anomalies. This hybrid loss is aided by the principle of maximizing the mutual information gain between the latent dimension and the high-dimensional prior data space by maximizing the entropy of the latent space while preserving useful correlation information of the original data in the low-dimensional latent space. The multi-objective function has two goals -- it measures correlation information in the latent feature space in the form of robust MD distance and simultaneously tries to preserve useful correlation information from the original data space in the latent space by maximizing mutual information between the prior and latent space.  ( 2 min )
    Decision Models for Selecting Federated Learning Architecture Patterns. (arXiv:2204.13291v2 [cs.LG] UPDATED)
    Federated learning is growing fast in academia and industries as a solution to solve data hungriness and privacy issues in machine learning. Being a widely distributed system, federated learning requires various system design thinking. To better design a federated learning system, researchers have introduced multiple patterns and tactics that cover various system design aspects. However, the multitude of patterns leaves the designers confused about when and which pattern to adopt. In this paper, we present a set of decision models for the selection of patterns for federated learning architecture design based on a systematic literature review on federated learning, to assist designers and architects who have limited knowledge of federated learning. Each decision model maps functional and non-functional requirements of federated learning systems to a set of patterns. We also clarify the trade-offs in the patterns. We evaluated the decision models by mapping the decision patterns to concrete federated learning architectures by big tech firms to assess the models' correctness and usefulness. The evaluation results indicate that the proposed decision models are able to bring structure to the federated learning architecture design process and help explicitly articulate the design rationale.  ( 2 min )
    Dimensionless machine learning: Imposing exact units equivariance. (arXiv:2204.00887v2 [stat.ML] UPDATED)
    Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis, and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods which are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology.  ( 2 min )
    Depthwise Convolution for Multi-Agent Communication with Enhanced Mean-Field Approximation. (arXiv:2203.02896v2 [cs.LG] UPDATED)
    Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protocol that exploits the ability of depthwise convolution to efficiently extract local relations and learn local communication between neighboring agents. To facilitate multi-agent coordination, we explicitly learn the effect of joint actions by taking the policies of neighboring agents as inputs. Second, we introduce the mean-field approximation into our method to reduce the scale of agent interactions. To more effectively coordinate behaviors of neighboring agents, we enhance the mean-field approximation by a supervised policy rectification network (PRN) for rectifying real-time agent interactions and by a learnable compensation term for correcting the approximation bias. The proposed method enables efficient coordination as well as outperforms several baseline approaches on the adaptive traffic signal control (ATSC) task and the StarCraft II multi-agent challenge (SMAC).  ( 2 min )
    Secure Distributed Training at Scale. (arXiv:2106.11257v4 [cs.LG] UPDATED)
    Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.  ( 2 min )
    Distribution Embedding Networks for Generalization from a Diverse Set of Classification Tasks. (arXiv:2202.01940v2 [stat.ML] UPDATED)
    We propose Distribution Embedding Networks (DEN) for classification with small data. In the same spirit of meta-learning, DEN learns from a diverse set of training tasks with the goal to generalize to unseen target tasks. Unlike existing approaches which require the inputs of training and target tasks to have the same dimension with possibly similar distributions, DEN allows training and target tasks to live in heterogeneous input spaces. This is especially useful for tabular-data tasks where labeled data from related tasks are scarce. DEN uses a three-block architecture: a covariate transformation block followed by a distribution embedding block and then a classification block. We provide theoretical insights to show that this architecture allows the embedding and classification blocks to be fixed after pre-training on a diverse set of tasks; only the covariate transformation block with relatively few parameters needs to be fine-tuned for each new task. To facilitate training, we also propose an approach to synthesize binary classification tasks, and demonstrate that DEN outperforms existing methods in a number of synthetic and real tasks in numerical studies.  ( 2 min )
    Neural System Level Synthesis: Learning over All Stabilizing Policies for Nonlinear Systems. (arXiv:2203.11812v2 [eess.SY] UPDATED)
    We address the problem of designing stabilizing control policies for nonlinear systems in discrete-time, while minimizing an arbitrary cost function. When the system is linear and the cost is convex, the System Level Synthesis (SLS) approach offers an effective solution based on convex programming. Beyond this case, a globally optimal solution cannot be found in a tractable way, in general. In this paper, we develop a parametrization of all and only the control policies stabilizing a given time-varying nonlinear system in terms of the combined effect of 1) a strongly stabilizing base controller and 2) a stable SLS operator to be freely designed. Based on this result, we propose a Neural SLS (Neur-SLS) approach guaranteeing closed-loop stability during and after parameter optimization, without requiring any constraints to be satisfied. We exploit recent Deep Neural Network (DNN) models based on Recurrent Equilibrium Networks (RENs) to learn over a rich class of nonlinear stable operators, and demonstrate the effectiveness of the proposed approach in numerical examples.  ( 2 min )
    Low-Rank Updates of Matrix Square Roots. (arXiv:2201.13156v2 [math.NA] UPDATED)
    Models in which the covariance matrix has the structure of a sparse matrix plus a low rank perturbation are ubiquitous in machine learning applications. It is often desirable for learning algorithms to take advantage of such structures, avoiding costly matrix computations that often require cubic time and quadratic storage. This is often accomplished by performing operations that maintain such structures, e.g. matrix inversion via the Sherman-Morrison-Woodbury formula. In this paper we consider the matrix square root and inverse square root operations. Given a low rank perturbation to a matrix, we argue that a low-rank approximate correction to the (inverse) square root exists. We do so by establishing a geometric decay bound on the true correction's eigenvalues. We then proceed to frame the correction has the solution of an algebraic Ricatti equation, and discuss how a low-rank solution to that equation can be computed. We analyze the approximation error incurred when approximately solving the algebraic Ricatti equation, providing spectral and Frobenius norm forward and backward error bounds. Finally, we describe several applications of our algorithms, and demonstrate their utility in numerical experiments.  ( 2 min )
    On the Importance of Regularisation & Auxiliary Information in OOD Detection. (arXiv:2107.07564v2 [cs.LG] UPDATED)
    Neural networks are often utilised in critical domain applications (e.g. self-driving cars, financial markets, and aerospace engineering), even though they exhibit overconfident predictions for ambiguous inputs. This deficiency demonstrates a fundamental flaw indicating that neural networks often overfit on spurious correlations. To address this problem in this work we present two novel objectives that improve the ability of a network to detect out-of-distribution samples and therefore avoid overconfident predictions for ambiguous inputs. We empirically demonstrate that our methods outperform the baseline and perform better than the majority of existing approaches while still maintaining a competitive performance against the rest. Additionally, we empirically demonstrate the robustness of our approach against common corruptions and demonstrate the importance of regularisation and auxiliary information in out-of-distribution detection.  ( 2 min )
    CORA: Benchmarks, Baselines, and Metrics as a Platform for Continual Reinforcement Learning Agents. (arXiv:2110.10067v2 [cs.LG] UPDATED)
    Progress in continual reinforcement learning has been limited due to several barriers to entry: missing code, high compute requirements, and a lack of suitable benchmarks. In this work, we present CORA, a platform for Continual Reinforcement Learning Agents that provides benchmarks, baselines, and metrics in a single code package. The benchmarks we provide are designed to evaluate different aspects of the continual RL challenge, such as catastrophic forgetting, plasticity, ability to generalize, and sample-efficient learning. Three of the benchmarks utilize video game environments (Atari, Procgen, NetHack). The fourth benchmark, CHORES, consists of four different task sequences in a visually realistic home simulator, drawn from a diverse set of task and scene parameters. To compare continual RL methods on these benchmarks, we prepare three metrics in CORA: Continual Evaluation, Isolated Forgetting, and Zero-Shot Forward Transfer. Finally, CORA includes a set of performant, open-source baselines of existing algorithms for researchers to use and expand on. We release CORA and hope that the continual RL community can benefit from our contributions, to accelerate the development of new continual RL algorithms.  ( 2 min )
    On minimizers and convolutional filters: a partial justification for the effectiveness of CNNs in categorical sequence analysis. (arXiv:2111.08452v4 [cs.LG] UPDATED)
    Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how those filters can be used to classify the sequence. In this manuscript, we demonstrate through a careful mathematical analysis of hash function properties that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. This provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.  ( 2 min )
    InfoFair: Information-Theoretic Intersectional Fairness. (arXiv:2105.11069v2 [cs.LG] UPDATED)
    Algorithmic fairness is becoming increasingly important in data mining and machine learning. Among others, a foundational notation is group fairness. The vast majority of the existing works on group fairness, with a few exceptions, primarily focus on debiasing with respect to a single sensitive attribute, despite the fact that the co-existence of multiple sensitive attributes (e.g., gender, race, marital status, etc.) in the real-world is commonplace. As such, methods that can ensure a fair learning outcome with respect to all sensitive attributes of concern simultaneously need to be developed. In this paper, we study the problem of information-theoretic intersectional fairness (InfoFair), where statistical parity, a representative group fairness measure, is guaranteed among demographic groups formed by multiple sensitive attributes of interest. We formulate it as a mutual information minimization problem and propose a generic end-to-end algorithmic framework to solve it. The key idea is to leverage a variational representation of mutual information, which considers the variational distribution between learning outcomes and sensitive attributes, as well as the density ratio between the variational and the original distributions. Our proposed framework is generalizable to many different settings, including other statistical notions of fairness, and could handle any type of learning task equipped with a gradient-based optimizer. Empirical evaluations in the fair classification task on three real-world datasets demonstrate that our proposed framework can effectively debias the classification results with minimal impact to the classification accuracy.  ( 2 min )
    The Fragility of Optimized Bandit Algorithms. (arXiv:2109.13595v5 [cs.LG] UPDATED)
    Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also provide a sharp trade-off between the amount of UCB exploration and the tail exponent of the resulting regret distribution.  ( 2 min )
    Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems. (arXiv:2109.02384v3 [math.OC] UPDATED)
    In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs. Similar derivations have been done before but not using state-space representation.  ( 2 min )
    HeLayers: A Tile Tensors Framework for Large Neural Networks on Encrypted Data. (arXiv:2011.01805v3 [cs.CR] UPDATED)
    Privacy-preserving solutions enable companies to offload confidential data to third-party services while fulfilling their government regulations. To accomplish this, they leverage various cryptographic techniques such as Homomorphic Encryption (HE), which allows performing computation on encrypted data. Most HE schemes work in a SIMD fashion, and the data packing method can dramatically affect the running time and memory costs. Finding a packing method that leads to an optimal performant implementation is a hard task. We present a simple and intuitive framework that abstracts the packing decision for the user. We explain its underlying data structures and optimizer, and propose a novel algorithm for performing 2D convolution operations. We used this framework to implement an HE-friendly version of AlexNet, which runs in three minutes, several orders of magnitude faster than other state-of-the-art solutions that only use HE.  ( 2 min )
    Sinkhorn Distributionally Robust Optimization. (arXiv:2109.11926v2 [math.OC] UPDATED)
    We study distributionally robust optimization (DRO) with Sinkhorn distance -- a variant of Wasserstein distance based on entropic regularization. We provide convex programming dual reformulation for a general nominal distribution. Compared with Wasserstein DRO, it is computationally tractable for a larger class of loss functions, and its worst-case distribution is more reasonable. We propose an efficient first-order algorithm with bisection search to solve the dual reformulation. We demonstrate that our proposed algorithm finds $\delta$-optimal solution of the new DRO formulation with computation cost $\tilde{O}(\delta^{-3})$ and memory cost $\tilde{O}(\delta^{-2})$, and the computation cost further improves to $\tilde{O}(\delta^{-2})$ when the loss function is smooth. Finally, we provide various numerical examples using both synthetic and real data to demonstrate its competitive performance and light computational speed.  ( 2 min )
    Friedrichs Learning: Weak Solutions of Partial Differential Equations via Deep Learning. (arXiv:2012.08023v3 [math.NA] UPDATED)
    This paper proposes Friedrichs learning as a novel deep learning methodology that can learn the weak solutions of PDEs via a minmax formulation, which transforms the PDE problem into a minimax optimization problem to identify weak solutions. The name "Friedrichs learning" is for highlighting the close relationship between our learning strategy and Friedrichs theory on symmetric systems of PDEs. The weak solution and the test function in the weak formulation are parameterized as deep neural networks in a mesh-free manner, which are alternately updated to approach the optimal solution networks approximating the weak solution and the optimal test function, respectively. Extensive numerical results indicate that our mesh-free method can provide reasonably good solutions to a wide range of PDEs defined on regular and irregular domains in various dimensions, where classical numerical methods such as finite difference methods and finite element methods may be tedious or difficult to be applied.  ( 2 min )
    Fusing Models for Prognostics and Health Management of Lithium-Ion Batteries Based on Physics-Informed Neural Networks. (arXiv:2301.00776v1 [eess.SP])
    For Prognostics and Health Management (PHM) of Lithium-ion (Li-ion) batteries, many models have been established to characterize their degradation process. The existing empirical or physical models can reveal important information regarding the degradation dynamics. However, there is no general and flexible methods to fuse the information represented by those models. Physics-Informed Neural Network (PINN) is an efficient tool to fuse empirical or physical dynamic models with data-driven models. To take full advantage of various information sources, we propose a model fusion scheme based on PINN. It is implemented by developing a semi-empirical semi-physical Partial Differential Equation (PDE) to model the degradation dynamics of Li-ion-batteries. When there is little prior knowledge about the dynamics, we leverage the data-driven Deep Hidden Physics Model (DeepHPM) to discover the underlying governing dynamic models. The uncovered dynamics information is then fused with that mined by the surrogate neural network in the PINN framework. Moreover, an uncertainty-based adaptive weighting method is employed to balance the multiple learning tasks when training the PINN. The proposed methods are verified on a public dataset of Li-ion Phosphate (LFP)/graphite batteries.  ( 2 min )
    Causal Inference (C-inf) -- closed form worst case typical phase transitions. (arXiv:2301.00793v1 [stat.ML])
    In this paper we establish a mathematically rigorous connection between Causal inference (C-inf) and the low-rank recovery (LRR). Using Random Duality Theory (RDT) concepts developed in [46,48,50] and novel mathematical strategies related to free probability theory, we obtain the exact explicit typical (and achievable) worst case phase transitions (PT). These PT precisely separate scenarios where causal inference via LRR is possible from those where it is not. We supplement our mathematical analysis with numerical experiments that confirm the theoretical predictions of PT phenomena, and further show that the two closely match for fairly small sample sizes. We obtain simple closed form representations for the resulting PTs, which highlight direct relations between the low rankness of the target C-inf matrix and the time of the treatment. Hence, our results can be used to determine the range of C-inf's typical applicability.  ( 2 min )
    Projection Robust Wasserstein Distance and Riemannian Optimization. (arXiv:2006.07458v10 [cs.LG] UPDATED)
    Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.  ( 3 min )
    Robust machine learning pipelines for trading market-neutral stock portfolios. (arXiv:2301.00790v1 [q-fin.CP])
    The application of deep learning algorithms to financial data is difficult due to heavy non-stationarities which can lead to over-fitted models that underperform under regime changes. Using the Numerai tournament data set as a motivating example, we propose a machine learning pipeline for trading market-neutral stock portfolios based on tabular data which is robust under changes in market conditions. We evaluate various machine-learning models, including Gradient Boosting Decision Trees (GBDTs) and Neural Networks with and without simple feature engineering, as the building blocks for the pipeline. We find that GBDT models with dropout display high performance, robustness and generalisability with relatively low complexity and reduced computational cost. We then show that online learning techniques can be used in post-prediction processing to enhance the results. In particular, dynamic feature neutralisation, an efficient procedure that requires no retraining of models and can be applied post-prediction to any machine learning model, improves robustness by reducing drawdown in volatile market conditions. Furthermore, we demonstrate that the creation of model ensembles through dynamic model selection based on recent model performance leads to improved performance over baseline by improving the Sharpe and Calmar ratios. We also evaluate the robustness of our pipeline across different data splits and random seeds with good reproducibility of results.  ( 2 min )
    Causal Inference (C-inf) -- asymmetric scenario of typical phase transitions. (arXiv:2301.00801v1 [stat.ML])
    In this paper, we revisit and further explore a mathematically rigorous connection between Causal inference (C-inf) and the Low-rank recovery (LRR) established in [10]. Leveraging the Random duality - Free probability theory (RDT-FPT) connection, we obtain the exact explicit typical C-inf asymmetric phase transitions (PT). We uncover a doubling low-rankness phenomenon, which means that exactly two times larger low rankness is allowed in asymmetric scenarios compared to the symmetric worst case ones considered in [10]. Consequently, the final PT mathematical expressions are as elegant as those obtained in [10], and highlight direct relations between the targeted C-inf matrix low rankness and the time of treatment. Our results have strong implications for applications, where C-inf matrices are not necessarily symmetric.  ( 2 min )
    G-CEALS: Gaussian Cluster Embedding in Autoencoder Latent Space for Tabular Data Representation. (arXiv:2301.00802v1 [cs.LG])
    The latent space of autoencoders has been improved for clustering image data by jointly learning a t-distributed embedding with a clustering algorithm inspired by the neighborhood embedding concept proposed for data visualization. However, multivariate tabular data pose different challenges in representation learning than image data, where traditional machine learning is often superior to deep tabular data learning. In this paper, we address the challenges of learning tabular data in contrast to image data and present a novel Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS) algorithm by replacing t-distributions with multivariate Gaussian clusters. Unlike current methods, the proposed approach independently defines the Gaussian embedding and the target cluster distribution to accommodate any clustering algorithm in representation learning. A trained G-CEALS model extracts a quality embedding for unseen test data. Based on the embedding clustering accuracy, the average rank of the proposed G-CEALS method is 1.4 (0.7), which is superior to all eight baseline clustering and cluster embedding methods on seven tabular data sets. This paper shows one of the first algorithms to jointly learn embedding and clustering to improve multivariate tabular data representation in downstream clustering.  ( 2 min )
    On Bilevel Optimization without Lower-level Strong Convexity. (arXiv:2301.00712v1 [math.OC])
    Theoretical properties of bilevel problems are well studied when the lower-level problem is strongly convex. In this work, we focus on bilevel optimization problems without the strong-convexity assumption. In these cases, we first show that the common local optimality measures such as KKT condition or regularization can lead to undesired consequences. Then, we aim to identify the mildest conditions that make bilevel problems tractable. We identify two classes of growth conditions on the lower-level objective that leads to continuity. Under these assumptions, we show that the local optimality of the bilevel problem can be defined via the Goldstein stationarity condition of the hyper-objective. We then propose the Inexact Gradient-Free Method (IGFM) to solve the bilevel problem, using an approximate zeroth order oracle that is of independent interest. Our non-asymptotic analysis demonstrates that the proposed method can find a $(\delta, \varepsilon)$ Goldstein stationary point for bilevel problems with a zeroth order oracle complexity that is polynomial in $d, 1/\delta$ and $1/\varepsilon$.  ( 2 min )
    CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. (arXiv:2301.00785v1 [eess.IV])
    An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.  ( 2 min )
    Muse: Text-To-Image Generation via Masked Generative Transformers. (arXiv:2301.00704v1 [cs.CV])
    We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io  ( 2 min )
    SIRL: Similarity-based Implicit Representation Learning. (arXiv:2301.00810v1 [cs.RO])
    When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.  ( 3 min )
    Human-in-the-Loop Hate Speech Classification in a Multilingual Context. (arXiv:2212.02108v2 [cs.CL] UPDATED)
    The shift of public debate to the digital sphere has been accompanied by a rise in online hate speech. While many promising approaches for hate speech classification have been proposed, studies often focus only on a single language, usually English, and do not address three key concerns: post-deployment performance, classifier maintenance and infrastructural limitations. In this paper, we introduce a new human-in-the-loop BERT-based hate speech classification pipeline and trace its development from initial data collection and annotation all the way to post-deployment. Our classifier, trained using data from our original corpus of over 422k examples, is specifically developed for the inherently multilingual setting of Switzerland and outperforms with its F1 score of 80.5 the currently best-performing BERT-based multilingual classifier by 5.8 F1 points in German and 3.6 F1 points in French. Our systematic evaluations over a 12-month period further highlight the vital importance of continuous, human-in-the-loop classifier maintenance to ensure robust hate speech classification post-deployment.  ( 2 min )
    Graph Construction from Data using Non Negative Kernel regression (NNK Graphs). (arXiv:1910.09383v2 [cs.LG] UPDATED)
    Data-driven neighborhood definitions and graph constructions are often used in machine learning and signal processing applications. k-nearest neighbor~(kNN) and $\epsilon$-neighborhood methods are among the most common methods used for neighborhood selection, due to their computational simplicity. However, the choice of parameters associated with these methods, such as k and $\epsilon$, is still ad hoc. We make two main contributions in this paper. First, we present an alternative view of neighborhood selection, where we show that neighborhood construction is equivalent to a sparse signal approximation problem. Second, we propose an algorithm, non-negative kernel regression~(NNK), for obtaining neighborhoods that lead to better sparse representation. NNK draws similarities to the orthogonal matching pursuit approach to signal representation and possesses desirable geometric and theoretical properties. Experiments demonstrate (i) the robustness of the NNK algorithm for neighborhood and graph construction, (ii) its ability to adapt the number of neighbors to the data properties, and (iii) its superior performance in local neighborhood and graph-based machine learning tasks.  ( 2 min )
    PCRLv2: A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis. (arXiv:2301.00772v1 [cs.CV])
    Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.  ( 2 min )
    Massive Language Models Can Be Accurately Pruned in One-Shot. (arXiv:2301.00774v1 [cs.LG])
    We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches.  ( 2 min )
    A Survey on Federated Recommendation Systems. (arXiv:2301.00767v1 [cs.IR])
    Federated learning has recently been applied to recommendation systems to protect user privacy. In federated learning settings, recommendation systems can train recommendation models only collecting the intermediate parameters instead of the real user data, which greatly enhances the user privacy. Beside, federated recommendation systems enable to collaborate with other data platforms to improve recommended model performance while meeting the regulation and privacy constraints. However, federated recommendation systems faces many new challenges such as privacy, security, heterogeneity and communication costs. While significant research has been conducted in these areas, gaps in the surveying literature still exist. In this survey, we-(1) summarize some common privacy mechanisms used in federated recommendation systems and discuss the advantages and limitations of each mechanism; (2) review some robust aggregation strategies and several novel attacks against security; (3) summarize some approaches to address heterogeneity and communication costs problems; (4)introduce some open source platforms that can be used to build federated recommendation systems; (5) present some prospective research directions in the future. This survey can guide researchers and practitioners understand the research progress in these areas.  ( 2 min )
    Training Differentially Private Graph Neural Networks with Random Walk Sampling. (arXiv:2301.00738v1 [cs.LG])
    Deep learning models are known to put the privacy of their training data at risk, which poses challenges for their safe and ethical release to the public. Differentially private stochastic gradient descent is the de facto standard for training neural networks without leaking sensitive information about the training data. However, applying it to models for graph-structured data poses a novel challenge: unlike with i.i.d. data, sensitive information about a node in a graph cannot only leak through its gradients, but also through the gradients of all nodes within a larger neighborhood. In practice, this limits privacy-preserving deep learning on graphs to very shallow graph neural networks. We propose to solve this issue by training graph neural networks on disjoint subgraphs of a given training graph. We develop three random-walk-based methods for generating such disjoint subgraphs and perform a careful analysis of the data-generating distributions to provide strong privacy guarantees. Through extensive experiments, we show that our method greatly outperforms the state-of-the-art baseline on three large graphs, and matches or outperforms it on four smaller ones.  ( 2 min )
    Robust Consensus Clustering and its Applications for Advertising Forecasting. (arXiv:2301.00717v1 [cs.LG])
    Consensus clustering aggregates partitions in order to find a better fit by reconciling clustering results from different sources/executions. In practice, there exist noise and outliers in clustering task, which, however, may significantly degrade the performance. To address this issue, we propose a novel algorithm -- robust consensus clustering that can find common ground truth among experts' opinions, which tends to be minimally affected by the bias caused by the outliers. In particular, we formalize the robust consensus clustering problem as a constraint optimization problem, and then derive an effective algorithm upon alternating direction method of multipliers (ADMM) with rigorous convergence guarantee. Our method outperforms the baselines on benchmarks. We apply the proposed method to the real-world advertising campaign segmentation and forecasting tasks using the proposed consensus clustering results based on the similarity computed via Kolmogorov-Smirnov Statistics. The accurate clustering result is helpful for building the advertiser profiles so as to perform the forecasting.  ( 2 min )
    Product Ranking for Revenue Maximization with Multiple Purchases. (arXiv:2210.08268v3 [cs.LG] UPDATED)
    Product ranking is the core problem for revenue-maximizing online retailers. To design proper product ranking algorithms, various consumer choice models are proposed to characterize the consumers' behaviors when they are provided with a list of products. However, existing works assume that each consumer purchases at most one product or will keep viewing the product list after purchasing a product, which does not agree with the common practice in real scenarios. In this paper, we assume that each consumer can purchase multiple products at will. To model consumers' willingness to view and purchase, we set a random attention span and purchase budget, which determines the maximal amount of products that he/she views and purchases, respectively. Under this setting, we first design an optimal ranking policy when the online retailer can precisely model consumers' behaviors. Based on the policy, we further develop the Multiple-Purchase-with-Budget UCB (MPB-UCB) algorithms with $\~O(\sqrt{T})$ regret that estimate consumers' behaviors and maximize revenue simultaneously in online settings. Experiments on both synthetic and semi-synthetic datasets prove the effectiveness of the proposed algorithms.  ( 2 min )
    Ontology-based Context Aware Recommender System Application for Tourism. (arXiv:2301.00768v1 [cs.IR])
    In this work a novel recommender system (RS) for Tourism is presented. The RS is context aware as is now the rule in the state-of-the-art for recommender systems and works on top of a tourism ontology which is used to group the different items being offered. The presented RS mixes different types of recommenders creating an ensemble which changes on the basis of the RS's maturity. Starting from simple content-based recommendations and iteratively adding popularity, demographic and collaborative filtering methods as rating density and user cardinality increases. The result is a RS that mutates during its lifetime and uses a tourism ontology and natural language processing (NLP) to correctly bin the items to specific item categories and meta categories in the ontology. This item classification facilitates the association between user preferences and items, as well as allowing to better classify and group the items being offered, which in turn is particularly useful for context-aware filtering.  ( 2 min )
    Online Linearized LASSO. (arXiv:2211.06039v2 [stat.ML] UPDATED)
    Sparse regression has been a popular approach to perform variable selection and enhance the prediction accuracy and interpretability of the resulting statistical model. Existing approaches focus on offline regularized regression, while the online scenario has rarely been studied. In this paper, we propose a novel online sparse linear regression framework for analyzing streaming data when data points arrive sequentially. Our proposed method is memory efficient and requires less stringent restricted strong convexity assumptions. Theoretically, we show that with a properly chosen regularization parameter, the $\ell_2$-norm statistical error of our estimator diminishes to zero in the optimal order of $\tilde{O}({\sqrt{s/t}})$, where $s$ is the sparsity level, $t$ is the streaming sample size, and $\tilde{O}(\cdot)$ hides logarithmic terms. Numerical experiments demonstrate the practical efficiency of our algorithm.  ( 2 min )
    Detection of Groups with Biased Representation in Ranking. (arXiv:2301.00719v1 [cs.LG])
    Real-life tools for decision-making in many critical domains are based on ranking results. With the increasing awareness of algorithmic fairness, recent works have presented measures for fairness in ranking. Many of those definitions consider the representation of different ``protected groups'', in the top-$k$ ranked items, for any reasonable $k$. Given the protected groups, confirming algorithmic fairness is a simple task. However, the groups' definitions may be unknown in advance. In this paper, we study the problem of detecting groups with biased representation in the top-$k$ ranked items, eliminating the need to pre-define protected groups. The number of such groups possible can be exponential, making the problem hard. We propose efficient search algorithms for two different fairness measures: global representation bounds, and proportional representation. Then we propose a method to explain the bias in the representations of groups utilizing the notion of Shapley values. We conclude with an experimental study, showing the scalability of our approach and demonstrating the usefulness of the proposed algorithms.  ( 2 min )
    Attribute Inference Attacks in Online Multiplayer Video Games: a Case Study on Dota2. (arXiv:2210.09028v2 [cs.CR] UPDATED)
    Did you know that over 70 million of Dota2 players have their in-game data freely accessible? What if such data is used in malicious ways? This paper is the first to investigate such a problem. Motivated by the widespread popularity of video games, we propose the first threat model for Attribute Inference Attacks (AIA) in the Dota2 context. We explain how (and why) attackers can exploit the abundant public data in the Dota2 ecosystem to infer private information about its players. Due to lack of concrete evidence on the efficacy of our AIA, we empirically prove and assess their impact in reality. By conducting an extensive survey on $\sim$500 Dota2 players spanning over 26k matches, we verify whether a correlation exists between a player's Dota2 activity and their real-life. Then, after finding such a link ($p\!0.3$), we ethically perform diverse AIA. We leverage the capabilities of machine learning to infer real-life attributes of the respondents of our survey by using their publicly available in-game data. Our results show that, by applying domain expertise, some AIA can reach up to 98% precision and over 90% accuracy. This paper hence raises the alarm on a subtle, but concrete threat that can potentially affect the entire competitive gaming landscape. We alerted the developers of Dota2.  ( 2 min )
    Medical Diffusion: Denoising Diffusion Probabilistic Models for 3D Medical Image Generation. (arXiv:2211.03364v6 [eess.IV] UPDATED)
    Recent advances in computer vision have shown promising results in image generation. Diffusion probabilistic models in particular have generated realistic images from textual input, as demonstrated by DALL-E 2, Imagen and Stable Diffusion. However, their use in medicine, where image data typically comprises three-dimensional volumes, has not been systematically evaluated. Synthetic images may play a crucial role in privacy preserving artificial intelligence and can also be used to augment small datasets. Here we show that diffusion probabilistic models can synthesize high quality medical imaging data, which we show for Magnetic Resonance Images (MRI) and Computed Tomography (CT) images. We provide quantitative measurements of their performance through a reader study with two medical experts who rated the quality of the synthesized images in three categories: Realistic image appearance, anatomical correctness and consistency between slices. Furthermore, we demonstrate that synthetic images can be used in a self-supervised pre-training and improve the performance of breast segmentation models when data is scarce (dice score 0.91 vs. 0.95 without vs. with synthetic data).  ( 2 min )
    The Hypervolume Indicator Hessian Matrix: Analytical Expression, Computational Time Complexity, and Sparsity. (arXiv:2211.04171v3 [math.OC] UPDATED)
    The problem of approximating the Pareto front of a multiobjective optimization problem can be reformulated as the problem of finding a set that maximizes the hypervolume indicator. This paper establishes the analytical expression of the Hessian matrix of the mapping from a (fixed size) collection of $n$ points in the $d$-dimensional decision space (or $m$ dimensional objective space) to the scalar hypervolume indicator value. To define the Hessian matrix, the input set is vectorized, and the matrix is derived by analytical differentiation of the mapping from a vectorized set to the hypervolume indicator. The Hessian matrix plays a crucial role in second-order methods, such as the Newton-Raphson optimization method, and it can be used for the verification of local optimal sets. So far, the full analytical expression was only established and analyzed for the relatively simple bi-objective case. This paper will derive the full expression for arbitrary dimensions ($m\geq2$ objective functions). For the practically important three-dimensional case, we also provide an asymptotically efficient algorithm with time complexity in $O(n\log n)$ for the exact computation of the Hessian Matrix' non-zero entries. We establish a sharp bound of $12m-6$ for the number of non-zero entries. Also, for the general $m$-dimensional case, a compact recursive analytical expression is established, and its algorithmic implementation is discussed. Also, for the general case, some sparsity results can be established; these results are implied by the recursive expression. To validate and illustrate the analytically derived algorithms and results, we provide a few numerical examples using Python and Mathematica implementations. Open-source implementations of the algorithms and testing data are made available as a supplement to this paper.  ( 2 min )
    Pseudo AI Bias. (arXiv:2210.08141v2 [cs.AI] UPDATED)
    Pseudo Artificial Intelligence bias (PAIB) is broadly disseminated in the literature, which can result in unnecessary AI fear in society, exacerbate the enduring inequities and disparities in access to and sharing the benefits of AI applications, and waste social capital invested in AI research. This study systematically reviews publications in the literature to present three types of PAIBs identified due to: a) misunderstandings, b) pseudo mechanical bias, and c) over-expectations. We discussed the consequences of and solutions to PAIBs, including certifying users for AI applications to mitigate AI fears, providing customized user guidance for AI applications, and developing systematic approaches to monitor bias. We concluded that PAIB due to misunderstandings, pseudo mechanical bias, and over-expectations of algorithmic predictions is socially harmful.  ( 2 min )
    Temporally Layered Architecture for Adaptive, Distributed and Continuous Control. (arXiv:2301.00723v1 [cs.NE])
    We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.  ( 2 min )
    Federated Multi-Agent Deep Reinforcement Learning Approach via Physics-Informed Reward for Multi-Microgrid Energy Management. (arXiv:2301.00641v1 [eess.SY])
    The utilization of large-scale distributed renewable energy promotes the development of the multi-microgrid (MMG), which raises the need of developing an effective energy management method to minimize economic costs and keep self energy-sufficiency. The multi-agent deep reinforcement learning (MADRL) has been widely used for the energy management problem because of its real-time scheduling ability. However, its training requires massive energy operation data of microgrids (MGs), while gathering these data from different MGs would threaten their privacy and data security. Therefore, this paper tackles this practical yet challenging issue by proposing a federated multi-agent deep reinforcement learning (F-MADRL) algorithm via the physics-informed reward. In this algorithm, the federated learning (FL) mechanism is introduced to train the F-MADRL algorithm thus ensures the privacy and the security of data. In addition, a decentralized MMG model is built, and the energy of each participated MG is managed by an agent, which aims to minimize economic costs and keep self energy-sufficiency according to the physics-informed reward. At first, MGs individually execute the self-training based on local energy operation data to train their local agent models. Then, these local models are periodically uploaded to a server and their parameters are aggregated to build a global agent, which will be broadcasted to MGs and replace their local agents. In this way, the experience of each MG agent can be shared and the energy operation data is not explicitly transmitted, thus protecting the privacy and ensuring data security. Finally, experiments are conducted on Oak Ridge national laboratory distributed energy control communication lab microgrid (ORNL-MG) test system, and the comparisons are carried out to verify the effectiveness of introducing the FL mechanism and the outperformance of our proposed F-MADRL.  ( 3 min )
    IRT2: Inductive Linking and Ranking in Knowledge Graphs of Varying Scale. (arXiv:2301.00716v1 [cs.LG])
    We address the challenge of building domain-specific knowledge models for industrial use cases, where labelled data and taxonomic information is initially scarce. Our focus is on inductive link prediction models as a basis for practical tools that support knowledge engineers with exploring text collections and discovering and linking new (so-called open-world) entities to the knowledge graph. We argue that - though neural approaches to text mining have yielded impressive results in the past years - current benchmarks do not reflect the typical challenges encountered in the industrial wild properly. Therefore, our first contribution is an open benchmark coined IRT2 (inductive reasoning with text) that (1) covers knowledge graphs of varying sizes (including very small ones), (2) comes with incidental, low-quality text mentions, and (3) includes not only triple completion but also ranking, which is relevant for supporting experts with discovery tasks. We investigate two neural models for inductive link prediction, one based on end-to-end learning and one that learns from the knowledge graph and text data in separate steps. These models compete with a strong bag-of-words baseline. The results show a significant advance in performance for the neural approaches as soon as the available graph data decreases for linking. For ranking, the results are promising, and the neural approaches outperform the sparse retriever by a wide margin.  ( 2 min )
    E-commerce users' preferences for delivery options. (arXiv:2301.00666v1 [econ.GN])
    Many e-commerce marketplaces offer their users fast delivery options for free to meet the increasing needs of users, imposing an excessive burden on city logistics. Therefore, understanding e-commerce users' preference for delivery options is a key to designing logistics policies. To this end, this study designs a stated choice survey in which respondents are faced with choice tasks among different delivery options and time slots, which was completed by 4,062 users from the three major metropolitan areas in Japan. To analyze the data, mixed logit models capturing taste heterogeneity as well as flexible substitution patterns have been estimated. The model estimation results indicate that delivery attributes including fee, time, and time slot size are significant determinants of the delivery option choices. Associations between users' preferences and socio-demographic characteristics, such as age, gender, teleworking frequency and the presence of a delivery box, were also suggested. Moreover, we analyzed two willingness-to-pay measures for delivery, namely, the value of delivery time savings (VODT) and the value of time slot shortening (VOTS), and applied a non-semiparametric approach to estimate their distributions in a data-oriented manner. Although VODT has a large heterogeneity among respondents, the estimated median VODT is 25.6 JPY/day, implying that more than half of the respondents would wait an additional day if the delivery fee were increased by only 26 JPY, that is, they do not necessarily need a fast delivery option but often request it when cheap or almost free. Moreover, VOTS was found to be low, distributed with the median of 5.0 JPY/hour; that is, users do not highly value the reduction in time slot size in monetary terms. These findings on e-commerce users' preferences can help in designing levels of service for last-mile delivery to significantly improve its efficiency.  ( 2 min )
    New Designed Loss Functions to Solve Ordinary Differential Equations with Artificial Neural Network. (arXiv:2301.00636v1 [cs.LG])
    This paper investigates the use of artificial neural networks (ANNs) to solve differential equations (DEs) and the construction of the loss function which meets both differential equation and its initial/boundary condition of a certain DE. In section 2, the loss function is generalized to $n^\text{th}$ order ordinary differential equation(ODE). Other methods of construction are examined in Section 3 and applied to three different models to assess their effectiveness.  ( 2 min )
    Mixed moving average field guided learning for spatio-temporal data. (arXiv:2301.00736v1 [stat.ML])
    Influenced mixed moving average fields are a versatile modeling class for spatio-temporal data. However, their predictive distribution is not generally accessible. Under this modeling assumption, we define a novel theory-guided machine learning approach that employs a generalized Bayesian algorithm to make predictions. We employ a Lipschitz predictor, for example, a linear model or a feed-forward neural network, and determine a randomized estimator by minimizing a novel PAC Bayesian bound for data serially correlated along a spatial and temporal dimension. Performing causal future predictions is a highlight of our methodology as its potential application to data with short and long-range dependence. We conclude by showing the performance of the learning methodology in an example with linear predictors and simulated spatio-temporal data from an STOU process.  ( 2 min )
    Reinforcement Learning with Success Induced Task Prioritization. (arXiv:2301.00691v1 [cs.LG])
    Many challenging reinforcement learning (RL) problems require designing a distribution of tasks that can be applied to train effective policies. This distribution of tasks can be specified by the curriculum. A curriculum is meant to improve the results of learning and accelerate it. We introduce Success Induced Task Prioritization (SITP), a framework for automatic curriculum learning, where a task sequence is created based on the success rate of each task. In this setting, each task is an algorithmically created environment instance with a unique configuration. The algorithm selects the order of tasks that provide the fastest learning for agents. The probability of selecting any of the tasks for the next stage of learning is determined by evaluating its performance score in previous stages. Experiments were carried out in the Partially Observable Grid Environment for Multiple Agents (POGEMA) and Procgen benchmark. We demonstrate that SITP matches or surpasses the results of other curriculum design methods. Our method can be implemented with handful of minor modifications to any standard RL framework and provides useful prioritization with minimal computational overhead.  ( 2 min )
    Tsetlin Machine Embedding: Representing Words Using Logical Expressions. (arXiv:2301.00709v1 [cs.CL])
    Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee," thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.  ( 2 min )
    A RL-based Policy Optimization Method Guided by Adaptive Stability Certification. (arXiv:2301.00521v1 [cs.RO])
    In contrast to the control-theoretic methods, the lack of stability guarantee remains a significant problem for model-free reinforcement learning (RL) methods. Jointly learning a policy and a Lyapunov function has recently become a promising approach to ensuring the whole system with a stability guarantee. However, the classical Lyapunov constraints researchers introduced cannot stabilize the system during the sampling-based optimization. Therefore, we propose the Adaptive Stability Certification (ASC), making the system reach sampling-based stability. Because the ASC condition can search for the optimal policy heuristically, we design the Adaptive Lyapunov-based Actor-Critic (ALAC) algorithm based on the ASC condition. Meanwhile, our algorithm avoids the optimization problem that a variety of constraints are coupled into the objective in current approaches. When evaluated on ten robotic tasks, our method achieves lower accumulated cost and fewer stability constraint violations than previous studies.  ( 2 min )
    Federated Learning with Client-Exclusive Classes. (arXiv:2301.00489v1 [cs.LG])
    Existing federated classification algorithms typically assume the local annotations at every client cover the same set of classes. In this paper, we aim to lift such an assumption and focus on a more general yet practical non-IID setting where every client can work on non-identical and even disjoint sets of classes (i.e., client-exclusive classes), and the clients have a common goal which is to build a global classification model to identify the union of these classes. Such heterogeneity in client class sets poses a new challenge: how to ensure different clients are operating in the same latent space so as to avoid the drift after aggregation? We observe that the classes can be described in natural languages (i.e., class names) and these names are typically safe to share with all parties. Thus, we formulate the classification problem as a matching process between data representations and class representations and break the classification model into a data encoder and a label encoder. We leverage the natural-language class names as the common ground to anchor the class representations in the label encoder. In each iteration, the label encoder updates the class representations and regulates the data representations through matching. We further use the updated class representations at each round to annotate data samples for locally-unaware classes according to similarity and distill knowledge to local models. Extensive experiments on four real-world datasets show that the proposed method can outperform various classical and state-of-the-art federated learning methods designed for learning with non-IID data.  ( 2 min )
    Human-in-the-loop Embodied Intelligence with Interactive Simulation Environment for Surgical Robot Learning. (arXiv:2301.00452v1 [cs.RO])
    Surgical robot automation has attracted increasing research interest over the past decade, expecting its huge potential to benefit surgeons, nurses and patients. Recently, the learning paradigm of embodied AI has demonstrated promising ability to learn good control policies for various complex tasks, where embodied AI simulators play an essential role to facilitate relevant researchers. However, existing open-sourced simulators for surgical robot are still not sufficiently supporting human interactions through physical input devices, which further limits effective investigations on how human demonstrations would affect policy learning. In this paper, we study human-in-the-loop embodied intelligence with a new interactive simulation platform for surgical robot learning. Specifically, we establish our platform based on our previously released SurRoL simulator with several new features co-developed to allow high-quality human interaction via an input device. With these, we further propose to collect human demonstrations and imitate the action patterns to achieve more effective policy learning. We showcase the improvement of our simulation environment with the designed new features and tasks, and validate state-of-the-art reinforcement learning algorithms using the interactive environment. Promising results are obtained, with which we hope to pave the way for future research on surgical embodied intelligence. Our platform is released and will be continuously updated in the website: https://med-air.github.io/SurRoL/  ( 2 min )
    A principled distributional approach to trajectory similarity measurement. (arXiv:2301.00393v1 [cs.LG])
    Existing measures and representations for trajectories have two longstanding fundamental shortcomings, i.e., they are computationally expensive and they can not guarantee the `uniqueness' property of a distance function: dist(X,Y) = 0 if and only if X=Y, where $X$ and $Y$ are two trajectories. This paper proposes a simple yet powerful way to represent trajectories and measure the similarity between two trajectories using a distributional kernel to address these shortcomings. It is a principled approach based on kernel mean embedding which has a strong theoretical underpinning. It has three distinctive features in comparison with existing approaches. (1) A distributional kernel is used for the very first time for trajectory representation and similarity measurement. (2) It does not rely on point-to-point distances which are used in most existing distances for trajectories. (3) It requires no learning, unlike existing learning and deep learning approaches. We show the generality of this new approach in three applications: (a) trajectory anomaly detection, (b) anomalous sub-trajectory detection, and (c) trajectory pattern mining. We identify that the distributional kernel has (i) a unique data-dependent property and the above uniqueness property which are the key factors that lead to its superior task-specific performance; and (ii) runtime orders of magnitude faster than existing distance measures.  ( 2 min )
    Hierarchical Explanations for Video Action Recognition. (arXiv:2301.00436v1 [cs.CV])
    We propose Hierarchical ProtoPNet: an interpretable network that explains its reasoning process by considering the hierarchical relationship between classes. Different from previous methods that explain their reasoning process by dissecting the input image and finding the prototypical parts responsible for the classification, we propose to explain the reasoning process for video action classification by dissecting the input video frames on multiple levels of the class hierarchy. The explanations leverage the hierarchy to deal with uncertainty, akin to human reasoning: When we observe water and human activity, but no definitive action it can be recognized as the water sports parent class. Only after observing a person swimming can we definitively refine it to the swimming action. Experiments on ActivityNet and UCF-101 show performance improvements while providing multi-level explanations.  ( 2 min )
    Efficient Online Learning with Memory via Frank-Wolfe Optimization: Algorithms with Bounded Dynamic Regret and Applications to Control. (arXiv:2301.00497v1 [cs.LG])
    Projection operations are a typical computation bottleneck in online learning. In this paper, we enable projection-free online learning within the framework of Online Convex Optimization with Memory (OCO-M) -- OCO-M captures how the history of decisions affects the current outcome by allowing the online learning loss functions to depend on both current and past decisions. Particularly, we introduce the first projection-free meta-base learning algorithm with memory that minimizes dynamic regret, i.e., that minimizes the suboptimality against any sequence of time-varying decisions. We are motivated by artificial intelligence applications where autonomous agents need to adapt to time-varying environments in real-time, accounting for how past decisions affect the present. Examples of such applications are: online control of dynamical systems; statistical arbitrage; and time series prediction. The algorithm builds on the Online Frank-Wolfe (OFW) and Hedge algorithms. We demonstrate how our algorithm can be applied to the online control of linear time-varying systems in the presence of unpredictable process noise. To this end, we develop the first controller with memory and bounded dynamic regret against any optimal time-varying linear feedback control policy. We validate our algorithm in simulated scenarios of online control of linear time-invariant systems.  ( 2 min )
    FedICT: Federated Multi-task Distillation for Multi-access Edge Computing. (arXiv:2301.00389v1 [cs.LG])
    The growing interest in intelligent services and privacy protection for mobile devices has given rise to the widespread application of federated learning in Multi-access Edge Computing (MEC). Diverse user behaviors call for personalized services with heterogeneous Machine Learning (ML) models on different devices. Federated Multi-task Learning (FMTL) is proposed to train related but personalized ML models for different devices, whereas previous works suffer from excessive communication overhead during training and neglect the model heterogeneity among devices in MEC. Introducing knowledge distillation into FMTL can simultaneously enable efficient communication and model heterogeneity among clients, whereas existing methods rely on a public dataset, which is impractical in reality. To tackle this dilemma, Federated MultI-task Distillation for Multi-access Edge CompuTing (FedICT) is proposed. FedICT direct local-global knowledge aloof during bi-directional distillation processes between clients and the server, aiming to enable multi-task clients while alleviating client drift derived from divergent optimization directions of client-side local models. Specifically, FedICT includes Federated Prior Knowledge Distillation (FPKD) and Local Knowledge Adjustment (LKA). FPKD is proposed to reinforce the clients' fitting of local data by introducing prior knowledge of local data distributions. Moreover, LKA is proposed to correct the distillation loss of the server, making the transferred local knowledge better match the generalized representation. Experiments on three datasets show that FedICT significantly outperforms all compared benchmarks in various data heterogeneous and model architecture settings, achieving improved accuracy with less than 1.2% training communication overhead compared with FedAvg and no more than 75% training communication round compared with FedGKT.  ( 2 min )
    On the Challenges of using Reinforcement Learning in Precision Drug Dosing: Delay and Prolongedness of Action Effects. (arXiv:2301.00512v1 [cs.LG])
    Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged effect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favorable qualitative behavior in our policy analysis.  ( 2 min )
    Neural Collapse in Deep Linear Network: From Balanced to Imbalanced Data. (arXiv:2301.00437v1 [cs.LG])
    Modern deep neural networks have achieved superhuman performance in tasks from image classification to game play. Surprisingly, these various complex systems with massive amounts of parameters exhibit the same remarkable structural properties in their last-layer features and classifiers across canonical datasets. This phenomenon is known as "Neural Collapse," and it was discovered empirically by Papyan et al. \cite{Papyan20}. Recent papers have theoretically shown the global solutions to the training network problem under a simplified "unconstrained feature model" exhibiting this phenomenon. We take a step further and prove the Neural Collapse occurrence for deep linear network for the popular mean squared error (MSE) and cross entropy (CE) loss. Furthermore, we extend our research to imbalanced data for MSE loss and present the first geometric analysis for Neural Collapse under this setting.  ( 2 min )
    CORGI-PM: A Chinese Corpus For Gender Bias Probing and Mitigation. (arXiv:2301.00395v1 [cs.CL])
    As natural language processing (NLP) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. Moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. We also conduct experiments with state-of-the-art language models to provide baselines. To our best knowledge, CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.  ( 2 min )
    A Concept Knowledge Graph for User Next Intent Prediction at Alipay. (arXiv:2301.00503v1 [cs.CL])
    This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. Specifically, we propose AlipayKG to explicitly characterize user intent, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.  ( 2 min )
    Conditional Diffusion Based on Discrete Graph Structures for Molecular Graph Generation. (arXiv:2301.00427v1 [cs.LG])
    Learning the underlying distribution of molecular graphs and generating high-fidelity samples is a fundamental research problem in drug discovery and material science. However, accurately modeling distribution and rapidly generating novel molecular graphs remain crucial and challenging goals. To accomplish these goals, we propose a novel Conditional Diffusion model based on discrete Graph Structures (CDGS) for molecular graph generation. Specifically, we construct a forward graph diffusion process on both graph structures and inherent features through stochastic differential equations (SDE) and derive discrete graph structures as the condition for reverse generative processes. We present a specialized hybrid graph noise prediction model that extracts the global context and the local node-edge dependency from intermediate graph states. We further utilize ordinary differential equation (ODE) solvers for efficient graph sampling, based on the semi-linear structure of the probability flow ODE. Experiments on diverse datasets validate the effectiveness of our framework. Particularly, the proposed method still generates high-quality molecular graphs in a limited number of steps.  ( 2 min )
    MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs. (arXiv:2301.00407v1 [cs.LG])
    New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensive and practical benchmark study for MIG in order to eliminate the need for tedious manual benchmarking and tuning efforts. To achieve this vision, the paper presents MIGPerf, an open-source tool that streamlines the benchmark study for MIG. Using MIGPerf, the authors conduct a series of experiments, including deep learning training and inference characterization on MIG, GPU sharing characterization, and framework compatibility with MIG. The results of these experiments provide new insights and guidance for users to effectively employ MIG, and lay the foundation for further research on the orchestration of hybrid training and inference workloads on MIGs. The code and results are released on https://github.com/MLSysOps/MIGProfiler. This work is still in progress and more results will be published soon.  ( 2 min )
    Unsupervised Acoustic Scene Mapping Based on Acoustic Features and Dimensionality Reduction. (arXiv:2301.00448v1 [eess.AS])
    Classical methods for acoustic scene mapping require the estimation of time difference of arrival (TDOA) between microphones. Unfortunately, TDOA estimation is very sensitive to reverberation and additive noise. We introduce an unsupervised data-driven approach that exploits the natural structure of the data. Our method builds upon local conformal autoencoders (LOCA) - an offline deep learning scheme for learning standardized data coordinates from measurements. Our experimental setup includes a microphone array that measures the transmitted sound source at multiple locations across the acoustic enclosure. We demonstrate that LOCA learns a representation that is isometric to the spatial locations of the microphones. The performance of our method is evaluated using a series of realistic simulations and compared with other dimensionality-reduction schemes. We further assess the influence of reverberation on the results of LOCA and show that it demonstrates considerable robustness.  ( 2 min )
    Mapping smallholder cashew plantations to inform sustainable tree crop expansion in Benin. (arXiv:2301.00363v1 [cs.CV])
    Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy of 80% and the CASTC model achieved an overall accuracy of 77.9%. We found that the cashew area in Benin has doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 70%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape.  ( 2 min )
    Image To Tree with Recursive Prompting. (arXiv:2301.00447v1 [cs.CV])
    Extracting complex structures from grid-based data is a common key step in automated medical image analysis. The conventional solution to recovering tree-structured geometries typically involves computing the minimal cost path through intermediate representations derived from segmentation masks. However, this methodology has significant limitations in the context of projective imaging of tree-structured 3D anatomical data such as coronary arteries, since there are often overlapping branches in the 2D projection. In this work, we propose a novel approach to predicting tree connectivity structure which reformulates the task as an optimization problem over individual steps of a recursive process. We design and train a two-stage model which leverages the UNet and Transformer architectures and introduces an image-based prompting technique. Our proposed method achieves compelling results on a pair of synthetic datasets, and outperforms a shortest-path baseline.  ( 2 min )
    GANExplainer: GAN-based Graph Neural Networks Explainer. (arXiv:2301.00012v1 [cs.LG])
    With the rapid deployment of graph neural networks (GNNs) based techniques into a wide range of applications such as link prediction, node classification, and graph classification the explainability of GNNs has become an indispensable component for predictive and trustworthy decision-making. Thus, it is critical to explain why graph neural network (GNN) makes particular predictions for them to be believed in many applications. Some GNNs explainers have been proposed recently. However, they lack to generate accurate and real explanations. To mitigate these limitations, we propose GANExplainer, based on Generative Adversarial Network (GAN) architecture. GANExplainer is composed of a generator to create explanations and a discriminator to assist with the Generator development. We investigate the explanation accuracy of our models by comparing the performance of GANExplainer with other state-of-the-art methods. Our empirical results on synthetic datasets indicate that GANExplainer improves explanation accuracy by up to 35\% compared to its alternatives.  ( 2 min )
    SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. (arXiv:2301.00004v1 [q-bio.QM])
    Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (<50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.  ( 2 min )
    Discriminative Radial Domain Adaptation. (arXiv:2301.00383v1 [cs.LG])
    Domain adaptation methods reduce domain shift typically by learning domain-invariant features. Most existing methods are built on distribution matching, e.g., adversarial domain adaptation, which tends to corrupt feature discriminability. In this paper, we propose Discriminative Radial Domain Adaptation (DRDR) which bridges source and target domains via a shared radial structure. It's motivated by the observation that as the model is trained to be progressively discriminative, features of different categories expand outwards in different directions, forming a radial structure. We show that transferring such an inherently discriminative structure would enable to enhance feature transferability and discriminability simultaneously. Specifically, we represent each domain with a global anchor and each category a local anchor to form a radial structure and reduce domain shift via structure matching. It consists of two parts, namely isometric transformation to align the structure globally and local refinement to match each category. To enhance the discriminability of the structure, we further encourage samples to cluster close to the corresponding local anchors based on optimal-transport assignment. Extensively experimenting on multiple benchmarks, our method is shown to consistently outperforms state-of-the-art approaches on varied tasks, including the typical unsupervised domain adaptation, multi-source domain adaptation, domain-agnostic learning, and domain generalization.  ( 2 min )
    PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs. (arXiv:2301.00391v1 [cs.LG])
    Dynamic Graph Neural Networks (DGNNs) have been broadly applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle the challenges, we propose PiPAD, a $\underline{\textbf{Pi}}pelined$ and $\underline{\textbf{PA}}rallel$ $\underline{\textbf{D}}GNN$ training framework for the end-to-end performance optimization on GPUs. From both the algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates the unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves $1.22\times$-$9.57\times$ speedup over the state-of-the-art DGNN frameworks on three representative models.  ( 2 min )
    Recovering Top-Two Answers and Confusion Probability in Multi-Choice Crowdsourcing. (arXiv:2301.00006v1 [cs.HC])
    Crowdsourcing has emerged as an effective platform to label a large volume of data in a cost- and time-efficient manner. Most previous works have focused on designing an efficient algorithm to recover only the ground-truth labels of the data. In this paper, we consider multi-choice crowdsourced labeling with the goal of recovering not only the ground truth but also the most confusing answer and the confusion probability. The most confusing answer provides useful information about the task by revealing the most plausible answer other than the ground truth and how plausible it is. To theoretically analyze such scenarios, we propose a model where there are top-two plausible answers for each task, distinguished from the rest of choices. Task difficulty is quantified by the confusion probability between the top two, and worker reliability is quantified by the probability of giving an answer among the top two. Under this model, we propose a two-stage inference algorithm to infer the top-two answers as well as the confusion probability. We show that our algorithm achieves the minimax optimal convergence rate. We conduct both synthetic and real-data experiments and demonstrate that our algorithm outperforms other recent algorithms. We also show the applicability of our algorithms in inferring the difficulty of tasks and training neural networks with the soft labels composed of the top-two most plausible classes.  ( 2 min )
    Goal-guided Transformer-enabled Reinforcement Learning for Efficient Autonomous Navigation. (arXiv:2301.00362v1 [cs.RO])
    Despite some successful applications of goal-driven navigation, existing deep reinforcement learning-based approaches notoriously suffers from poor data efficiency issue. One of the reasons is that the goal information is decoupled from the perception module and directly introduced as a condition of decision-making, resulting in the goal-irrelevant features of the scene representation playing an adversary role during the learning process. In light of this, we present a novel Goal-guided Transformer-enabled reinforcement learning (GTRL) approach by considering the physical goal states as an input of the scene encoder for guiding the scene representation to couple with the goal information and realizing efficient autonomous navigation. More specifically, we propose a novel variant of the Vision Transformer as the backbone of the perception system, namely Goal-guided Transformer (GoT), and pre-train it with expert priors to boost the data efficiency. Subsequently, a reinforcement learning algorithm is instantiated for the decision-making system, taking the goal-oriented scene representation from the GoT as the input and generating decision commands. As a result, our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process, leading to superior navigation performance. Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization, compared with other state-of-art baselines. Demonstration videos are available at \colorb{https://youtu.be/93LGlGvaN0c.  ( 2 min )
    Correlation Clustering Algorithm for Dynamic Complete Signed Graphs: An Index-based Approach. (arXiv:2301.00384v1 [cs.DS])
    In this paper, we reduce the complexity of approximating the correlation clustering problem from $O(m\times\left( 2+ \alpha (G) \right)+n)$ to $O(m+n)$ for any given value of $\varepsilon$ for a complete signed graph with $n$ vertices and $m$ positive edges where $\alpha(G)$ is the arboricity of the graph. Our approach gives the same output as the original algorithm and makes it possible to implement the algorithm in a full dynamic setting where edge sign flipping and vertex addition/removal are allowed. Constructing this index costs $O(m)$ memory and $O(m\times\alpha(G))$ time. We also studied the structural properties of the non-agreement measure used in the approximation algorithm. The theoretical results are accompanied by a full set of experiments concerning seven real-world graphs. These results shows superiority of our index-based algorithm to the non-index one by a decrease of %34 in time on average.  ( 2 min )
    A Global Optimization Algorithm for K-Center Clustering of One Billion Samples. (arXiv:2301.00061v1 [math.OC])
    This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.  ( 2 min )
    Skew Class-balanced Re-weighting for Unbiased Scene Graph Generation. (arXiv:2301.00351v1 [cs.LG])
    An unbiased scene graph generation (SGG) algorithm referred to as Skew Class-balanced Re-weighting (SCR) is proposed for considering the unbiased predicate prediction caused by the long-tailed distribution. The prior works focus mainly on alleviating the deteriorating performances of the minority predicate predictions, showing drastic dropping recall scores, i.e., losing the majority predicate performances. It has not yet correctly analyzed the trade-off between majority and minority predicate performances in the limited SGG datasets. In this paper, to alleviate the issue, the Skew Class-balanced Re-weighting (SCR) loss function is considered for the unbiased SGG models. Leveraged by the skewness of biased predicate predictions, the SCR estimates the target predicate weight coefficient and then re-weights more to the biased predicates for better trading-off between the majority predicates and the minority ones. Extensive experiments conducted on the standard Visual Genome dataset and Open Image V4 \& V6 show the performances and generality of the SCR with the traditional SGG models.  ( 2 min )
    Self-Supervised Object Segmentation with a Cut-and-Pasting GAN. (arXiv:2301.00366v1 [cs.CV])
    This paper proposes a novel self-supervised based Cut-and-Paste GAN to perform foreground object segmentation and generate realistic composite images without manual annotations. We accomplish this goal by a simple yet effective self-supervised approach coupled with the U-Net based discriminator. The proposed method extends the ability of the standard discriminators to learn not only the global data representations via classification (real/fake) but also learn semantic and structural information through pseudo labels created using the self-supervised task. The proposed method empowers the generator to create meaningful masks by forcing it to learn informative per-pixel as well as global image feedback from the discriminator. Our experiments demonstrate that our proposed method significantly outperforms the state-of-the-art methods on the standard benchmark datasets.  ( 2 min )
    Theoretical Characterization of How Neural Network Pruning Affects its Generalization. (arXiv:2301.00335v1 [cs.LG])
    It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the \textbf{first} generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization.  ( 2 min )
    Exploring Singularities in point clouds with the graph Laplacian: An explicit approach. (arXiv:2301.00201v1 [stat.ML])
    We develop theory and methods that use the graph Laplacian to analyze the geometry of the underlying manifold of point clouds. Our theory provides theoretical guarantees and explicit bounds on the functional form of the graph Laplacian, in the case when it acts on functions defined close to singularities of the underlying manifold. We also propose methods that can be used to estimate these geometric properties of the point cloud, which are based on the theoretical guarantees.  ( 2 min )
    A Functional approach for Two Way Dimension Reduction in Time Series. (arXiv:2301.00357v1 [cs.LG])
    The rise in data has led to the need for dimension reduction techniques, especially in the area of non-scalar variables, including time series, natural language processing, and computer vision. In this paper, we specifically investigate dimension reduction for time series through functional data analysis. Current methods for dimension reduction in functional data are functional principal component analysis and functional autoencoders, which are limited to linear mappings or scalar representations for the time series, which is inefficient. In real data applications, the nature of the data is much more complex. We propose a non-linear function-on-function approach, which consists of a functional encoder and a functional decoder, that uses continuous hidden layers consisting of continuous neurons to learn the structure inherent in functional data, which addresses the aforementioned concerns in the existing approaches. Our approach gives a low dimension latent representation by reducing the number of functional features as well as the timepoints at which the functions are observed. The effectiveness of the proposed model is demonstrated through multiple simulations and real data examples.  ( 2 min )
    Generalized PTR: User-Friendly Recipes for Data-Adaptive Algorithms with Differential Privacy. (arXiv:2301.00301v1 [cs.LG])
    The ''Propose-Test-Release'' (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is nice. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from ''Private Aggregation of Teacher Ensembles (PATE)'' -- privately releasing the entire model with a delicate data-dependent analysis.  ( 2 min )
    Sharper analysis of sparsely activated wide neural networks with trainable biases. (arXiv:2301.00327v1 [cs.LG])
    This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. Surprisingly, it is shown that the network after sparsification can achieve as fast convergence as the original network. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice data-dependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound.  ( 2 min )
    Causal Deep Learning: Causal Capsules and Tensor Transformers. (arXiv:2301.00314v1 [cs.LG])
    We derive a set of causal deep neural networks whose architectures are a consequence of tensor (multilinear) factor analysis. Forward causal questions are addressed with a neural network architecture composed of causal capsules and a tensor transformer. The former estimate a set of latent variables that represent the causal factors, and the latter governs their interaction. Causal capsules and tensor transformers may be implemented using shallow autoencoders, but for a scalable architecture we employ block algebra and derive a deep neural network composed of a hierarchy of autoencoders. An interleaved kernel hierarchy preprocesses the data resulting in a hierarchy of kernel tensor factor models. Inverse causal questions are addressed with a neural network that implements multilinear projection and estimates the causes of effects. As an alternative to aggressive bottleneck dimension reduction or regularized regression that may camouflage an inherently underdetermined inverse problem, we prescribe modeling different aspects of the mechanism of data formation with piecewise tensor models whose multilinear projections are well-defined and produce multiple candidate solutions. Our forward and inverse neural network architectures are suitable for asynchronous parallel computation.  ( 2 min )
    Broad Learning System with Takagi-Sugeno Fuzzy Subsystem for Tobacco Origin Identification based on Near Infrared Spectroscopy. (arXiv:2301.00126v1 [cs.LG])
    Tobacco origin identification is significantly important in tobacco industry. Modeling analysis for sensor data with near infrared spectroscopy has become a popular method for rapid detection of internal features. However, for sensor data analysis using traditional artificial neural network or deep network models, the training process is extremely time-consuming. In this paper, a novel broad learning system with Takagi-Sugeno (TS) fuzzy subsystem is proposed for rapid identification of tobacco origin. Incremental learning is employed in the proposed method, which obtains the weight matrix of the network after a very small amount of computation, resulting in much shorter training time for the model, with only about 3 seconds for the extra step training. The experimental results show that the TS fuzzy subsystem can extract features from the near infrared data and effectively improve the recognition performance. The proposed method can achieve the highest prediction accuracy (95.59 %) in comparison to the traditional classification algorithms, artificial neural network, and deep convolutional neural network, and has a great advantage in the training time with only about 128 seconds.  ( 2 min )
    Internet of Things: Digital Footprints Carry A Device Identity. (arXiv:2301.00328v1 [cs.LG])
    The usage of technologically advanced devices has seen a boom in many domains, including education, automation, and healthcare; with most of the services requiring Internet connectivity. To secure a network, device identification plays key role. In this paper, a device fingerprinting (DFP) model, which is able to distinguish between Internet of Things (IoT) and non-IoT devices, as well as uniquely identify individual devices, has been proposed. Four statistical features have been extracted from the consecutive five device-originated packets, to generate individual device fingerprints. The method has been evaluated using the Random Forest (RF) classifier and different datasets. Experimental results have shown that the proposed method achieves up to 99.8% accuracy in distinguishing between IoT and non-IoT devices and over 97.6% in classifying individual devices. These signify that the proposed method is useful in assisting operators in making their networks more secure and robust to security breaches and unauthorized access.  ( 2 min )
    Self-organization Preserved Graph Structure Learning with Principle of Relevant Information. (arXiv:2301.00015v1 [cs.LG])
    Most Graph Neural Networks follow the message-passing paradigm, assuming the observed structure depicts the ground-truth node relationships. However, this fundamental assumption cannot always be satisfied, as real-world graphs are always incomplete, noisy, or redundant. How to reveal the inherent graph structure in a unified way remains under-explored. We proposed PRI-GSL, a Graph Structure Learning framework guided by the Principle of Relevant Information, providing a simple and unified framework for identifying the self-organization and revealing the hidden structure. PRI-GSL learns a structure that contains the most relevant yet least redundant information quantified by von Neumann entropy and Quantum Jensen-Shannon divergence. PRI-GSL incorporates the evolution of quantum continuous walk with graph wavelets to encode node structural roles, showing in which way the nodes interplay and self-organize with the graph structure. Extensive experiments demonstrate the superior effectiveness and robustness of PRI-GSL.  ( 2 min )
    A Comparative Study of Image Disguising Methods for Confidential Outsourced Learning. (arXiv:2301.00252v1 [cs.CR])
    Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.  ( 2 min )
    MTNeuro: A Benchmark for Evaluating Representations of Brain Structure Across Multiple Levels of Abstraction. (arXiv:2301.00345v1 [cs.CV])
    There are multiple scales of abstraction from which we can describe the same image, depending on whether we are focusing on fine-grained details or a more global attribute of the image. In brain mapping, learning to automatically parse images to build representations of both small-scale features (e.g., the presence of cells or blood vessels) and global properties of an image (e.g., which brain region the image comes from) is a crucial and open challenge. However, most existing datasets and benchmarks for neuroanatomy consider only a single downstream task at a time. To bridge this gap, we introduce a new dataset, annotations, and multiple downstream tasks that provide diverse ways to readout information about brain structure and architecture from the same image. Our multi-task neuroimaging benchmark (MTNeuro) is built on volumetric, micrometer-resolution X-ray microtomography images spanning a large thalamocortical section of mouse brain, encompassing multiple cortical and subcortical regions. We generated a number of different prediction challenges and evaluated several supervised and self-supervised models for brain-region prediction and pixel-level semantic segmentation of microstructures. Our experiments not only highlight the rich heterogeneity of this dataset, but also provide insights into how self-supervised approaches can be used to learn representations that capture multiple attributes of a single image and perform well on a variety of downstream tasks. Datasets, code, and pre-trained baseline models are provided at: https://mtneuro.github.io/ .  ( 2 min )
    Self-Activating Neural Ensembles for Continual Reinforcement Learning. (arXiv:2301.00141v1 [cs.LG])
    The ability for an agent to continuously learn new skills without catastrophically forgetting existing knowledge is of critical importance for the development of generally intelligent agents. Most methods devised to address this problem depend heavily on well-defined task boundaries, and thus depend on human supervision. Our task-agnostic method, Self-Activating Neural Ensembles (SANE), uses a modular architecture designed to avoid catastrophic forgetting without making any such assumptions. At the beginning of each trajectory, a module in the SANE ensemble is activated to determine the agent's next policy. During training, new modules are created as needed and only activated modules are updated to ensure that unused modules remain unchanged. This system enables our method to retain and leverage old skills, while growing and learning new ones. We demonstrate our approach on visually rich procedurally generated environments.  ( 2 min )
    Physics-informed Neural Networks approach to solve the Blasius function. (arXiv:2301.00106v1 [cs.LG])
    Deep learning techniques with neural networks have been used effectively in computational fluid dynamics (CFD) to obtain solutions to nonlinear differential equations. This paper presents a physics-informed neural network (PINN) approach to solve the Blasius function. This method eliminates the process of changing the non-linear differential equation to an initial value problem. Also, it tackles the convergence issue arising in the conventional series solution. It is seen that this method produces results that are at par with the numerical and conventional methods. The solution is extended to the negative axis to show that PINNs capture the singularity of the function at $\eta=-5.69$  ( 2 min )
    Efficient On-device Training via Gradient Filtering. (arXiv:2301.00330v1 [cs.CV])
    Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device DNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple DNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19$\times$ speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20$\times$ speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training.  ( 2 min )
    Exploring the Use of Data-Driven Approaches for Anomaly Detection in the Internet of Things (IoT) Environment. (arXiv:2301.00134v1 [cs.LG])
    The Internet of Things (IoT) is a system that connects physical computing devices, sensors, software, and other technologies. Data can be collected, transferred, and exchanged with other devices over the network without requiring human interactions. One challenge the development of IoT faces is the existence of anomaly data in the network. Therefore, research on anomaly detection in the IoT environment has become popular and necessary in recent years. This survey provides an overview to understand the current progress of the different anomaly detection algorithms and how they can be applied in the context of the Internet of Things. In this survey, we categorize the widely used anomaly detection machine learning and deep learning techniques in IoT into three types: clustering-based, classification-based, and deep learning based. For each category, we introduce some state-of-the-art anomaly detection methods and evaluate the advantages and limitations of each technique.  ( 2 min )
    New Challenges in Reinforcement Learning: A Survey of Security and Privacy. (arXiv:2301.00188v1 [cs.LG])
    Reinforcement learning (RL) is one of the most important branches of AI. Due to its capacity for self-adaption and decision-making in dynamic environments, reinforcement learning has been widely applied in multiple areas, such as healthcare, data markets, autonomous driving, and robotics. However, some of these applications and systems have been shown to be vulnerable to security or privacy attacks, resulting in unreliable or unstable services. A large number of studies have focused on these security and privacy problems in reinforcement learning. However, few surveys have provided a systematic review and comparison of existing problems and state-of-the-art solutions to keep up with the pace of emerging threats. Accordingly, we herein present such a comprehensive review to explain and summarize the challenges associated with security and privacy in reinforcement learning from a new perspective, namely that of the Markov Decision Process (MDP). In this survey, we first introduce the key concepts related to this area. Next, we cover the security and privacy issues linked to the state, action, environment, and reward function of the MDP process, respectively. We further highlight the special characteristics of security and privacy methodologies related to reinforcement learning. Finally, we discuss the possible future research directions within this area.  ( 2 min )
    Lightmorphic Signatures Analysis Toolkit. (arXiv:2301.00281v1 [cs.LG])
    In this paper we discuss the theory used in the design of an open source lightmorphic signatures analysis toolkit (LSAT). In addition to providing a core functionality, the software package enables specific optimizations with its modular and customizable design. To promote its usage and inspire future contributions, LSAT is publicly available. By using a self-supervised neural network and augmented machine learning algorithms, LSAT provides an easy-to-use interface with ample documentation. The experiments demonstrate that LSAT improves the otherwise tedious and error-prone tasks of translating lightmorphic associated data into usable spectrograms, enhanced with parameter tuning and performance analysis. With the provided mathematical functions, LSAT validates the nonlinearity encountered in the data conversion process while ensuring suitability of the forecasting algorithms.  ( 2 min )
    An Adaptive Kernel Approach to Federated Learning of Heterogeneous Causal Effects. (arXiv:2301.00346v1 [cs.LG])
    We propose a new causal inference framework to learn causal effects from multiple, decentralized data sources in a federated setting. We introduce an adaptive transfer algorithm that learns the similarities among the data sources by utilizing Random Fourier Features to disentangle the loss function into multiple components, each of which is associated with a data source. The data sources may have different distributions; the causal effects are independently and systematically incorporated. The proposed method estimates the similarities among the sources through transfer coefficients, and hence requiring no prior information about the similarity measures. The heterogeneous causal effects can be estimated with no sharing of the raw training data among the sources, thus minimizing the risk of privacy leak. We also provide minimax lower bounds to assess the quality of the parameters learned from the disparate sources. The proposed method is empirically shown to outperform the baselines on decentralized data sources with dissimilar distributions.  ( 2 min )
    Mapping Knowledge Representations to Concepts: A Review and New Perspectives. (arXiv:2301.00189v1 [cs.AI])
    The success of neural networks builds to a large extent on their ability to create internal knowledge representations from real-world high-dimensional data, such as images, sound, or text. Approaches to extract and present these representations, in order to explain the neural network's decisions, is an active and multifaceted research field. To gain a deeper understanding of a central aspect of this field, we have performed a targeted review focusing on research that aims to associate internal representations with human understandable concepts. In doing this, we added a perspective on the existing research by using primarily deductive nomological explanations as a proposed taxonomy. We find this taxonomy and theories of causality, useful for understanding what can be expected, and not expected, from neural network explanations. The analysis additionally uncovers an ambiguity in the reviewed literature related to the goal of model explainability; is it understanding the ML model or, is it actionable explanations useful in the deployment domain?  ( 2 min )
    Approaching Peak Ground Truth. (arXiv:2301.00243v1 [cs.LG])
    Machine learning models are typically evaluated by computing similarity with reference annotations and trained by maximizing similarity with such. Especially in the bio-medical domain, annotations are subjective and suffer from low inter- and intra-rater reliability. Since annotations only reflect the annotation entity's interpretation of the real world, this can lead to sub-optimal predictions even though the model achieves high similarity scores. Here, the theoretical concept of Peak Ground Truth (PGT) is introduced. PGT marks the point beyond which an increase in similarity with the reference annotation stops translating to better Real World Model Performance (RWMP). Additionally, a quantitative technique to approximate PGT by computing inter- and intra-rater reliability is proposed. Finally, three categories of PGT-aware strategies to evaluate and improve model performance are reviewed.  ( 2 min )
    Computational Charisma -- A Brick by Brick Blueprint for Building Charismatic Artificial Intelligence. (arXiv:2301.00142v1 [cs.HC])
    Charisma is considered as one's ability to attract and potentially also influence others. Clearly, there can be considerable interest from an artificial intelligence's (AI) perspective to provide it with such skill. Beyond, a plethora of use cases opens up for computational measurement of human charisma, such as for tutoring humans in the acquisition of charisma, mediating human-to-human conversation, or identifying charismatic individuals in big social data. A number of models exist that base charisma on various dimensions, often following the idea that charisma is given if someone could and would help others. Examples include influence (could help) and affability (would help) in scientific studies or power (could help), presence, and warmth (both would help) as a popular concept. Modelling high levels in these dimensions for humanoid robots or virtual agents, seems accomplishable. Beyond, also automatic measurement appears quite feasible with the recent advances in the related fields of Affective Computing and Social Signal Processing. Here, we, thereforem present a blueprint for building machines that can appear charismatic, but also analyse the charisma of others. To this end, we first provide the psychological perspective including different models of charisma and behavioural cues of it. We then switch to conversational charisma in spoken language as an exemplary modality that is essential for human-human and human-computer conversations. The computational perspective then deals with the recognition and generation of charismatic behaviour by AI. This includes an overview of the state of play in the field and the aforementioned blueprint. We then name exemplary use cases of computational charismatic skills before switching to ethical aspects and concluding this overview and perspective on building charisma-enabled AI.  ( 3 min )
    Learning from Guided Play: Improving Exploration for Adversarial Imitation Learning with Simple Auxiliary Tasks. (arXiv:2301.00051v1 [cs.LG])
    Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naive approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.  ( 2 min )
    Adapting Node-Place Model to Predict and Monitor COVID-19 Footprints and Transmission Risks. (arXiv:2301.00117v1 [physics.soc-ph])
    The node-place model has been widely used to classify and evaluate transit stations, which sheds light on individual travel behaviors and supports urban planning through effectively integrating land use and transportation development. This article adapts this model to investigate whether and how node, place, and mobility would be associated with the transmission risks and presences of the local COVID-19 cases in a city. Similar studies on the model and its relevance to COVID-19, according to our knowledge, have not been undertaken before. Moreover, the unique metric drawn from detailed visit history of the infected, i.e., the COVID-19 footprints, is proposed and exploited. This study then empirically uses the adapted model to examine the station-level factors affecting the local COVID-19 footprints. The model accounts for traditional measures of the node and place as well as actual human mobility patterns associated with the node and place. It finds that stations with high node, place, and human mobility indices normally have more COVID-19 footprints in proximity. A multivariate regression is fitted to see whether and to what degree different indices and indicators can predict the COVID-19 footprints. The results indicate that many of the place, node, and human mobility indicators significantly impact the concentration of COVID-19 footprints. These are useful for policy-makers to predict and monitor hotspots for COVID-19 and other pandemics transmission.  ( 2 min )
    Towards Proactively Forecasting Sentence-Specific Information Popularity within Online News Documents. (arXiv:2301.00152v1 [cs.CL])
    Multiple studies have focused on predicting the prospective popularity of an online document as a whole, without paying attention to the contributions of its individual parts. We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content. We model sentence-specific popularity forecasting as a sequence regression task. For training our models, we curate InfoPop, the first dataset containing popularity labels for over 1.7 million sentences from over 50,000 online news documents. To the best of our knowledge, this is the first dataset automatically created using streams of incoming search engine queries to generate sentence-level popularity annotations. We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task. Our proposed technique coupled with a BERT-based neural model exceeds nDCG values of 0.8 for proactive sentence-specific popularity forecasting. Notably, our study presents a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting. We release InfoPop and make our code publicly available: https://github.com/sayarghoshroy/InfoPopularity  ( 2 min )
    Inference on Time Series Nonparametric Conditional Moment Restrictions Using General Sieves. (arXiv:2301.00092v1 [stat.ML])
    General nonlinear sieve learnings are classes of nonlinear sieves that can approximate nonlinear functions of high dimensional variables much more flexibly than various linear sieves (or series). This paper considers general nonlinear sieve quasi-likelihood ratio (GN-QLR) based inference on expectation functionals of time series data, where the functionals of interest are based on some nonparametric function that satisfy conditional moment restrictions and are learned using multilayer neural networks. While the asymptotic normality of the estimated functionals depends on some unknown Riesz representer of the functional space, we show that the optimally weighted GN-QLR statistic is asymptotically Chi-square distributed, regardless whether the expectation functional is regular (root-$n$ estimable) or not. This holds when the data are weakly dependent beta-mixing condition. We apply our method to the off-policy evaluation in reinforcement learning, by formulating the Bellman equation into the conditional moment restriction framework, so that we can make inference about the state-specific value functional using the proposed GN-QLR method with time series data. In addition, estimating the averaged partial means and averaged partial derivatives of nonparametric instrumental variables and quantile IV models are also presented as leading examples. Finally, a Monte Carlo study shows the finite sample performance of the procedure  ( 2 min )
    Intrinsic Motivation in Dynamical Control Systems. (arXiv:2301.00005v1 [cs.LG])
    Biological systems often choose actions without an explicit reward signal, a phenomenon known as intrinsic motivation. The computational principles underlying this behavior remain poorly understood. In this study, we investigate an information-theoretic approach to intrinsic motivation, based on maximizing an agent's empowerment (the mutual information between its past actions and future states). We show that this approach generalizes previous attempts to formalize intrinsic motivation, and we provide a computationally efficient algorithm for computing the necessary quantities. We test our approach on several benchmark control problems, and we explain its success in guiding intrinsically motivated behaviors by relating our information-theoretic control function to fundamental properties of the dynamical system representing the combined agent-environment system. This opens the door for designing practical artificial, intrinsically motivated controllers and for linking animal behaviors to their dynamical properties.  ( 2 min )
    Quantum Machine Learning Applied to the Classification of Diabetes. (arXiv:2301.00109v1 [cs.LG])
    Quantum Machine Learning (QML) shows how it maintains certain significant advantages over machine learning methods. It now shows that hybrid quantum methods have great scope for deployment and optimisation, and hold promise for future industries. As a weakness, quantum computing does not have enough qubits to justify its potential. This topic of study gives us encouraging results in the improvement of quantum coding, being the data preprocessing an important point in this research we employ two dimensionality reduction techniques LDA and PCA applying them in a hybrid way Quantum Support Vector Classifier (QSVC) and Variational Quantum Classifier (VQC) in the classification of Diabetes.  ( 2 min )
    Bayesian Learning for Dynamic Inference. (arXiv:2301.00032v1 [cs.LG])
    The traditional statistical inference is static, in the sense that the estimate of the quantity of interest does not affect the future evolution of the quantity. In some sequential estimation problems however, the future values of the quantity to be estimated depend on the estimate of its current value. This type of estimation problems has been formulated as the dynamic inference problem. In this work, we formulate the Bayesian learning problem for dynamic inference, where the unknown quantity-generation model is assumed to be randomly drawn according to a random model parameter. We derive the optimal Bayesian learning rules, both offline and online, to minimize the inference loss. Moreover, learning for dynamic inference can serve as a meta problem, such that all familiar machine learning problems, including supervised learning, imitation learning and reinforcement learning, can be cast as its special cases or variants. Gaining a good understanding of this unifying meta problem thus sheds light on a broad spectrum of machine learning problems as well.  ( 2 min )
    On the Geometry of Reinforcement Learning in Continuous State and Action Spaces. (arXiv:2301.00009v1 [cs.LG])
    Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.  ( 2 min )
    On High dimensional Poisson models with measurement error: hypothesis testing for nonlinear nonconvex optimization. (arXiv:2301.00139v1 [math.ST])
    We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.  ( 2 min )
    Contextual Bandits and Optimistically Universal Learning. (arXiv:2301.00241v1 [stat.ML])
    We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency -- vanishing regret compared to the optimal policy -- and show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism, a property known as universal consistency. Precisely, we first give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Second, we show that there always exists an algorithm that guarantees universal consistency whenever this is achievable, called an optimistically universal learning rule. Interestingly, for finite action spaces, learnable processes for universal learning are exactly the same as in the full-feedback setting of supervised learning, previously studied in the literature. In other words, learning can be performed with partial feedback without any generalization cost. The algorithms balance a trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts). Lastly, we consider the case of added continuity assumptions on rewards and show that these lead to universal consistency for significantly larger classes of data-generating processes.  ( 2 min )
    Source-Free Unsupervised Domain Adaptation: A Survey. (arXiv:2301.00265v1 [cs.CV])
    Unsupervised domain adaptation (UDA) via deep learning has attracted appealing attention for tackling domain-shift problems caused by distribution discrepancy across different domains. Existing UDA approaches highly depend on the accessibility of source domain data, which is usually limited in practical scenarios due to privacy protection, data storage and transmission cost, and computation burden. To tackle this issue, many source-free unsupervised domain adaptation (SFUDA) methods have been proposed recently, which perform knowledge transfer from a pre-trained source model to unlabeled target domain with source data inaccessible. A comprehensive review of these works on SFUDA is of great significance. In this paper, we provide a timely and systematic literature review of existing SFUDA approaches from a technical perspective. Specifically, we categorize current SFUDA studies into two groups, i.e., white-box SFUDA and black-box SFUDA, and further divide them into finer subcategories based on different learning strategies they use. We also investigate the challenges of methods in each subcategory, discuss the advantages/disadvantages of white-box and black-box SFUDA methods, conclude the commonly used benchmark datasets, and summarize the popular techniques for improved generalizability of models learned without using source data. We finally discuss several promising future directions in this field.  ( 2 min )
    Modified Query Expansion Through Generative Adversarial Networks for Information Extraction in E-Commerce. (arXiv:2301.00036v1 [cs.LG])
    This work addresses an alternative approach for query expansion (QE) using a generative adversarial network (GAN) to enhance the effectiveness of information search in e-commerce. We propose a modified QE conditional GAN (mQE-CGAN) framework, which resolves keywords by expanding the query with a synthetically generated query that proposes semantic information from text input. We train a sequence-to-sequence transformer model as the generator to produce keywords and use a recurrent neural network model as the discriminator to classify an adversarial output with the generator. With the modified CGAN framework, various forms of semantic insights gathered from the query document corpus are introduced to the generation process. We leverage these insights as conditions for the generator model and discuss their effectiveness for the query expansion task. Our experiments demonstrate that the utilization of condition structures within the mQE-CGAN framework can increase the semantic similarity between generated sequences and reference documents up to nearly 10% compared to baseline models  ( 2 min )
    Time series Forecasting to detect anomalous behaviours in Multiphase Flow Meters. (arXiv:2301.00014v1 [cs.LG])
    An Anomaly Detection (AD) System for Self-diagnosis has been developed for Multiphase Flow Meter (MPFM). The system relies on machine learning algorithms for time series forecasting, historical data have been used to train a model and to predict the behavior of a sensor and, thus, to detect anomalies.  ( 2 min )
    Hair and Scalp Disease Detection using Machine Learning and Image Processing. (arXiv:2301.00122v1 [cs.CV])
    Almost 80 million Americans suffer from hair loss due to aging, stress, medication, or genetic makeup. Hair and scalp-related diseases often go unnoticed in the beginning. Sometimes, a patient cannot differentiate between hair loss and regular hair fall. Diagnosing hair-related diseases is time-consuming as it requires professional dermatologists to perform visual and medical tests. Because of that, the overall diagnosis gets delayed, which worsens the severity of the illness. Due to the image-processing ability, neural network-based applications are used in various sectors, especially healthcare and health informatics, to predict deadly diseases like cancers and tumors. These applications assist clinicians and patients and provide an initial insight into early-stage symptoms. In this study, we used a deep learning approach that successfully predicts three main types of hair loss and scalp-related diseases: alopecia, psoriasis, and folliculitis. However, limited study in this area, unavailability of a proper dataset, and degree of variety among the images scattered over the internet made the task challenging. 150 images were obtained from various sources and then preprocessed by denoising, image equalization, enhancement, and data balancing, thereby minimizing the error rate. After feeding the processed data into the 2D convolutional neural network (CNN) model, we obtained overall training accuracy of 96.2%, with a validation accuracy of 91.1%. The precision and recall score of alopecia, psoriasis, and folliculitis are 0.895, 0.846, and 1.0, respectively. We also created a dataset of the scalp images for future prospective researchers.  ( 2 min )
    Effects of Data Geometry in Early Deep Learning. (arXiv:2301.00008v1 [cs.LG])
    Deep neural networks can approximate functions on different types of data, from images to graphs, with varied underlying structure. This underlying structure can be viewed as the geometry of the data manifold. By extending recent advances in the theoretical understanding of neural networks, we study how a randomly initialized neural network with piece-wise linear activation splits the data manifold into regions where the neural network behaves as a linear function. We derive bounds on the density of boundary of linear regions and the distance to these boundaries on the data manifold. This leads to insights into the expressivity of randomly initialized deep neural networks on non-Euclidean data sets. We empirically corroborate our theoretical results using a toy supervised learning problem. Our experiments demonstrate that number of linear regions varies across manifolds and the results hold with changing neural network architectures. We further demonstrate how the complexity of linear regions is different on the low dimensional manifold of images as compared to the Euclidean space, using the MetFaces dataset.  ( 2 min )
    Accuracy-Guaranteed Collaborative DNN Inference in Industrial IoT via Deep Reinforcement Learning. (arXiv:2301.00130v1 [eess.SY])
    Collaboration among industrial Internet of Things (IoT) devices and edge networks is essential to support computation-intensive deep neural network (DNN) inference services which require low delay and high accuracy. Sampling rate adaption which dynamically configures the sampling rates of industrial IoT devices according to network conditions, is the key in minimizing the service delay. In this paper, we investigate the collaborative DNN inference problem in industrial IoT networks. To capture the channel variation and task arrival randomness, we formulate the problem as a constrained Markov decision process (CMDP). Specifically, sampling rate adaption, inference task offloading and edge computing resource allocation are jointly considered to minimize the average service delay while guaranteeing the long-term accuracy requirements of different inference services. Since CMDP cannot be directly solved by general reinforcement learning (RL) algorithms due to the intractable long-term constraints, we first transform the CMDP into an MDP by leveraging the Lyapunov optimization technique. Then, a deep RL-based algorithm is proposed to solve the MDP. To expedite the training process, an optimization subroutine is embedded in the proposed algorithm to directly obtain the optimal edge computing resource allocation. Extensive simulation results are provided to demonstrate that the proposed RL-based algorithm can significantly reduce the average service delay while preserving long-term inference accuracy with a high probability.  ( 2 min )
    eVAE: Evolutionary Variational Autoencoder. (arXiv:2301.00011v1 [cs.NE])
    The surrogate loss of variational autoencoders (VAEs) poses various challenges to their training, inducing the imbalance between task fitting and representation inference. To avert this, the existing strategies for VAEs focus on adjusting the tradeoff by introducing hyperparameters, deriving a tighter bound under some mild assumptions, or decomposing the loss components per certain neural settings. VAEs still suffer from uncertain tradeoff learning.We propose a novel evolutionary variational autoencoder (eVAE) building on the variational information bottleneck (VIB) theory and integrative evolutionary neural learning. eVAE integrates a variational genetic algorithm into VAE with variational evolutionary operators including variational mutation, crossover, and evolution. Its inner-outer-joint training mechanism synergistically and dynamically generates and updates the uncertain tradeoff learning in the evidence lower bound (ELBO) without additional constraints. Apart from learning a lossy compression and representation of data under the VIB assumption, eVAE presents an evolutionary paradigm to tune critical factors of VAEs and deep neural networks and addresses the premature convergence and random search problem by integrating evolutionary optimization into deep learning. Experiments show that eVAE addresses the KL-vanishing problem for text generation with low reconstruction loss, generates all disentangled factors with sharp images, and improves the image generation quality,respectively. eVAE achieves better reconstruction loss, disentanglement, and generation-inference balance than its competitors.  ( 2 min )
    Behave-XAI: Deep Explainable Learning of Behavioral Representational Data. (arXiv:2301.00016v1 [cs.LG])
    According to the latest trend of artificial intelligence, AI-systems needs to clarify regarding general,specific decisions,services provided by it. Only consumer is satisfied, with explanation , for example, why any classification result is the outcome of any given time. This actually motivates us using explainable or human understandable AI for a behavioral mining scenario, where users engagement on digital platform is determined from context, such as emotion, activity, weather, etc. However, the output of AI-system is not always systematically correct, and often systematically correct, but apparently not-perfect and thereby creating confusions, such as, why the decision is given? What is the reason underneath? In this context, we first formulate the behavioral mining problem in deep convolutional neural network architecture. Eventually, we apply a recursive neural network due to the presence of time-series data from users physiological and environmental sensor-readings. Once the model is developed, explanations are presented with the advent of XAI models in front of users. This critical step involves extensive trial with users preference on explanations over conventional AI, judgement of credibility of explanation.  ( 2 min )
    Selected aspects of complex, hypercomplex and fuzzy neural networks. (arXiv:2301.00007v1 [cs.LG])
    This short report reviews the current state of the research and methodology on theoretical and practical aspects of Artificial Neural Networks (ANN). It was prepared to gather state-of-the-art knowledge needed to construct complex, hypercomplex and fuzzy neural networks. The report reflects the individual interests of the authors and, by now means, cannot be treated as a comprehensive review of the ANN discipline. Considering the fast development of this field, it is currently impossible to do a detailed review of a considerable number of pages. The report is an outcome of the Project 'The Strategic Research Partnership for the mathematical aspects of complex, hypercomplex and fuzzy neural networks' meeting at the University of Warmia and Mazury in Olsztyn, Poland, organized in September 2022.  ( 2 min )
  • Open

    Online Linearized LASSO. (arXiv:2211.06039v2 [stat.ML] UPDATED)
    Sparse regression has been a popular approach to perform variable selection and enhance the prediction accuracy and interpretability of the resulting statistical model. Existing approaches focus on offline regularized regression, while the online scenario has rarely been studied. In this paper, we propose a novel online sparse linear regression framework for analyzing streaming data when data points arrive sequentially. Our proposed method is memory efficient and requires less stringent restricted strong convexity assumptions. Theoretically, we show that with a properly chosen regularization parameter, the $\ell_2$-norm statistical error of our estimator diminishes to zero in the optimal order of $\tilde{O}({\sqrt{s/t}})$, where $s$ is the sparsity level, $t$ is the streaming sample size, and $\tilde{O}(\cdot)$ hides logarithmic terms. Numerical experiments demonstrate the practical efficiency of our algorithm.  ( 2 min )
    Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution. (arXiv:2204.13545v2 [cs.LG] UPDATED)
    Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA HTS is required to enrich single-cell data meaningfully. We introduce chemCPA, a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with an architecture surgery for transfer learning and demonstrate how training on existing bulk RNA HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating drug discovery.  ( 2 min )
    Causal Inference (C-inf) -- closed form worst case typical phase transitions. (arXiv:2301.00793v1 [stat.ML])
    In this paper we establish a mathematically rigorous connection between Causal inference (C-inf) and the low-rank recovery (LRR). Using Random Duality Theory (RDT) concepts developed in [46,48,50] and novel mathematical strategies related to free probability theory, we obtain the exact explicit typical (and achievable) worst case phase transitions (PT). These PT precisely separate scenarios where causal inference via LRR is possible from those where it is not. We supplement our mathematical analysis with numerical experiments that confirm the theoretical predictions of PT phenomena, and further show that the two closely match for fairly small sample sizes. We obtain simple closed form representations for the resulting PTs, which highlight direct relations between the low rankness of the target C-inf matrix and the time of the treatment. Hence, our results can be used to determine the range of C-inf's typical applicability.  ( 2 min )
    Dimensionless machine learning: Imposing exact units equivariance. (arXiv:2204.00887v2 [stat.ML] UPDATED)
    Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis, and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods which are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology.  ( 2 min )
    Posterior Collapse and Latent Variable Non-identifiability. (arXiv:2301.00537v1 [stat.ML])
    Variational autoencoders model high-dimensional data by positing low-dimensional latent variables that are mapped through a flexible distribution parametrized by a neural network. Unfortunately, variational autoencoders often suffer from posterior collapse: the posterior of the latent variables is equal to its prior, rendering the variational autoencoder useless as a means to produce meaningful representations. Existing approaches to posterior collapse often attribute it to the use of neural networks or optimization issues due to variational approximation. In this paper, we consider posterior collapse as a problem of latent variable non-identifiability. We prove that the posterior collapses if and only if the latent variables are non-identifiable in the generative model. This fact implies that posterior collapse is not a phenomenon specific to the use of flexible distributions or approximate inference. Rather, it can occur in classical probabilistic models even with exact inference, which we also demonstrate. Based on these results, we propose a class of latent-identifiable variational autoencoders, deep generative models which enforce identifiability without sacrificing flexibility. This model class resolves the problem of latent variable non-identifiability by leveraging bijective Brenier maps and parameterizing them with input convex neural networks, without special variational inference objectives or optimization tricks. Across synthetic and real datasets, latent-identifiable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data.  ( 2 min )
    Preface: Characterisation of Physical Processes from Anomalous Diffusion Data. (arXiv:2301.00800v1 [cond-mat.stat-mech])
    Preface to the special issue "Characterisation of Physical Processes from Anomalous Diffusion Data" associated with the Anomalous Diffusion Challenge ( https://andi-challenge.org ) and published in Journal of Physics A: Mathematical and Theoretical. The list of articles included in the special issue can be accessed at https://iopscience.iop.org/journal/1751-8121/page/Characterisation-of-Physical-Processes-from-Anomalous-Diffusion-Data .  ( 2 min )
    Optimality Guarantees for Particle Belief Approximation of POMDPs. (arXiv:2210.05015v2 [cs.AI] UPDATED)
    Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems. However, POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid, which is often the case for physical systems. While recent online sampling-based POMDP algorithms that plan with observation likelihood weighting have shown practical effectiveness, a general theory characterizing the approximation error of the particle filtering techniques that these algorithms use has not previously been proposed. Our main contribution is bounding the error between any POMDP and its corresponding finite sample particle belief MDP (PB-MDP) approximation. This fundamental bridge between PB-MDPs and POMDPs allows us to adapt any sampling-based MDP algorithm to a POMDP by solving the corresponding particle belief MDP, thereby extending the convergence guarantees of the MDP algorithm to the POMDP. Practically, this is implemented by using the particle filter belief transition model as the generative model for the MDP solver. While this requires access to the observation density model from the POMDP, it only increases the transition sampling complexity of the MDP solver by a factor of $\mathcal{O}(C)$, where $C$ is the number of particles. Thus, when combined with sparse sampling MDP algorithms, this approach can yield algorithms for POMDPs that have no direct theoretical dependence on the size of the state and observation spaces. In addition to our theoretical contribution, we perform five numerical experiments on benchmark POMDPs to demonstrate that a simple MDP algorithm adapted using PB-MDP approximation, Sparse-PFT, achieves performance competitive with other leading continuous observation POMDP solvers.  ( 2 min )
    Optimal Experimental Design for Staggered Rollouts. (arXiv:1911.03764v4 [econ.EM] UPDATED)
    In this paper, we study the design and analysis of experiments conducted on a set of units over multiple time periods where the starting time of the treatment may vary by unit. The design problem involves selecting an initial treatment time for each unit in order to most precisely estimate both the instantaneous and cumulative effects of the treatment. We first consider non-adaptive experiments, where all treatment assignment decisions are made prior to the start of the experiment. For this case, we show that the optimization problem is generally NP-hard and we propose a near-optimal solution. Under this solution the fraction entering treatment each period is initially low, then high, and finally low again. Next, we study an adaptive experimental design problem, where both the decision to continue the experiment and treatment assignment decisions are updated after each period's data is collected. For the adaptive case we propose a new algorithm, the Precision-Guided Adaptive Experiment (PGAE) algorithm, that addresses the challenges at both the design stage and at the stage of estimating treatment effects, ensuring valid post-experiment inference accounting for the adaptive nature of the design. Using realistic settings, we demonstrate that our proposed solutions can reduce the opportunity cost of the experiments by over 50\%, compared to static design benchmarks.  ( 2 min )
    Projection Robust Wasserstein Distance and Riemannian Optimization. (arXiv:2006.07458v10 [cs.LG] UPDATED)
    Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.  ( 3 min )
    Learning and interpreting asymmetry-labeled DAGs: a case study on COVID-19 fear. (arXiv:2301.00629v1 [cs.AI])
    Bayesian networks are widely used to learn and reason about the dependence structure of discrete variables. However, they are only capable of formally encoding symmetric conditional independence, which in practice is often too strict to hold. Asymmetry-labeled DAGs have been recently proposed to both extend the class of Bayesian networks by relaxing the symmetric assumption of independence and denote the type of dependence existing between the variables of interest. Here, we introduce novel structural learning algorithms for this class of models which, whilst being efficient, allow for a straightforward interpretation of the underlying dependence structure. A comprehensive computational study highlights the efficiency of the algorithms. A real-world data application using data from the Fear of COVID-19 Scale collected in Italy showcases their use in practice.  ( 2 min )
    Distribution Embedding Networks for Generalization from a Diverse Set of Classification Tasks. (arXiv:2202.01940v2 [stat.ML] UPDATED)
    We propose Distribution Embedding Networks (DEN) for classification with small data. In the same spirit of meta-learning, DEN learns from a diverse set of training tasks with the goal to generalize to unseen target tasks. Unlike existing approaches which require the inputs of training and target tasks to have the same dimension with possibly similar distributions, DEN allows training and target tasks to live in heterogeneous input spaces. This is especially useful for tabular-data tasks where labeled data from related tasks are scarce. DEN uses a three-block architecture: a covariate transformation block followed by a distribution embedding block and then a classification block. We provide theoretical insights to show that this architecture allows the embedding and classification blocks to be fixed after pre-training on a diverse set of tasks; only the covariate transformation block with relatively few parameters needs to be fine-tuned for each new task. To facilitate training, we also propose an approach to synthesize binary classification tasks, and demonstrate that DEN outperforms existing methods in a number of synthetic and real tasks in numerical studies.  ( 2 min )
    Graph Construction from Data using Non Negative Kernel regression (NNK Graphs). (arXiv:1910.09383v2 [cs.LG] UPDATED)
    Data-driven neighborhood definitions and graph constructions are often used in machine learning and signal processing applications. k-nearest neighbor~(kNN) and $\epsilon$-neighborhood methods are among the most common methods used for neighborhood selection, due to their computational simplicity. However, the choice of parameters associated with these methods, such as k and $\epsilon$, is still ad hoc. We make two main contributions in this paper. First, we present an alternative view of neighborhood selection, where we show that neighborhood construction is equivalent to a sparse signal approximation problem. Second, we propose an algorithm, non-negative kernel regression~(NNK), for obtaining neighborhoods that lead to better sparse representation. NNK draws similarities to the orthogonal matching pursuit approach to signal representation and possesses desirable geometric and theoretical properties. Experiments demonstrate (i) the robustness of the NNK algorithm for neighborhood and graph construction, (ii) its ability to adapt the number of neighbors to the data properties, and (iii) its superior performance in local neighborhood and graph-based machine learning tasks.  ( 2 min )
    ReSQueing Parallel and Private Stochastic Convex Optimization. (arXiv:2301.00457v1 [math.OC])
    We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{-2/3} + \epsilon_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. We give an $(\epsilon_{\text{dp}}, \delta)$-differentially private algorithm which, given $n$ samples of Lipschitz loss functions, obtains near-optimal optimization error and makes $\min(n, n^2\epsilon_{\text{dp}}^2 d^{-1}) + \min(n^{4/3}\epsilon_{\text{dp}}^{1/3}, (nd)^{2/3}\epsilon_{\text{dp}}^{-1})$ queries to the gradients of these functions. In the regime $d \le n \epsilon_{\text{dp}}^{2}$, where privacy comes at no cost in terms of the optimal loss up to constants, our algorithm uses $n + (nd)^{2/3}\epsilon_{\text{dp}}^{-1}$ queries and improves recent advancements of [KLL21, AFKT21]. In the moderately low-dimensional setting $d \le \sqrt n \epsilon_{\text{dp}}^{3/2}$, our query complexity is near-linear.  ( 2 min )
    Explicit construction of the minimum error variance estimator for stochastic LTI state-space systems. (arXiv:2109.02384v3 [math.OC] UPDATED)
    In this short article, we showcase the derivation of the optimal (minimum error variance) estimator, when one part of the stochastic LTI system output is not measured but is able to be predicted from the measured system outputs. Similar derivations have been done before but not using state-space representation.  ( 2 min )
    A Sequential Quadratic Programming Method with High Probability Complexity Bounds for Nonlinear Equality Constrained Stochastic Optimization. (arXiv:2301.00477v1 [math.OC])
    A step-search sequential quadratic programming method is proposed for solving nonlinear equality constrained stochastic optimization problems. It is assumed that constraint function values and derivatives are available, but only stochastic approximations of the objective function and its associated derivatives can be computed via inexact probabilistic zeroth- and first-order oracles. Under reasonable assumptions, a high-probability bound on the iteration complexity of the algorithm to approximate first-order stationarity is derived. Numerical results on standard nonlinear optimization test problems illustrate the advantages and limitations of our proposed method.  ( 2 min )
    Tree ensemble kernels for Bayesian optimization with known constraints over mixed-feature spaces. (arXiv:2207.00879v3 [stat.ML] UPDATED)
    Tree ensembles can be well-suited for black-box optimization tasks such as algorithm tuning and neural architecture search, as they achieve good predictive performance with little or no manual tuning, naturally handle discrete feature spaces, and are relatively insensitive to outliers in the training data. Two well-known challenges in using tree ensembles for black-box optimization are (i) effectively quantifying model uncertainty for exploration and (ii) optimizing over the piece-wise constant acquisition function. To address both points simultaneously, we propose using the kernel interpretation of tree ensembles as a Gaussian Process prior to obtain model variance estimates, and we develop a compatible optimization formulation for the acquisition function. The latter further allows us to seamlessly integrate known constraints to improve sampling efficiency by considering domain-knowledge in engineering settings and modeling search space symmetries, e.g., hierarchical relationships in neural architecture search. Our framework performs as well as state-of-the-art methods for unconstrained black-box optimization over continuous/discrete features and outperforms competing methods for problems combining mixed-variable feature spaces and known input constraints.  ( 2 min )
    Sinkhorn Distributionally Robust Optimization. (arXiv:2109.11926v2 [math.OC] UPDATED)
    We study distributionally robust optimization (DRO) with Sinkhorn distance -- a variant of Wasserstein distance based on entropic regularization. We provide convex programming dual reformulation for a general nominal distribution. Compared with Wasserstein DRO, it is computationally tractable for a larger class of loss functions, and its worst-case distribution is more reasonable. We propose an efficient first-order algorithm with bisection search to solve the dual reformulation. We demonstrate that our proposed algorithm finds $\delta$-optimal solution of the new DRO formulation with computation cost $\tilde{O}(\delta^{-3})$ and memory cost $\tilde{O}(\delta^{-2})$, and the computation cost further improves to $\tilde{O}(\delta^{-2})$ when the loss function is smooth. Finally, we provide various numerical examples using both synthetic and real data to demonstrate its competitive performance and light computational speed.  ( 2 min )
    Mixed moving average field guided learning for spatio-temporal data. (arXiv:2301.00736v1 [stat.ML])
    Influenced mixed moving average fields are a versatile modeling class for spatio-temporal data. However, their predictive distribution is not generally accessible. Under this modeling assumption, we define a novel theory-guided machine learning approach that employs a generalized Bayesian algorithm to make predictions. We employ a Lipschitz predictor, for example, a linear model or a feed-forward neural network, and determine a randomized estimator by minimizing a novel PAC Bayesian bound for data serially correlated along a spatial and temporal dimension. Performing causal future predictions is a highlight of our methodology as its potential application to data with short and long-range dependence. We conclude by showing the performance of the learning methodology in an example with linear predictors and simulated spatio-temporal data from an STOU process.  ( 2 min )
    The Fragility of Optimized Bandit Algorithms. (arXiv:2109.13595v5 [cs.LG] UPDATED)
    Much of the literature on optimal design of bandit algorithms is based on minimization of expected regret. It is well known that designs that are optimal over certain exponential families can achieve expected regret that grows logarithmically in the number of arm plays, at a rate governed by the Lai-Robbins lower bound. In this paper, we show that when one uses such optimized designs, the regret distribution of the associated algorithms necessarily has a very heavy tail, specifically, that of a truncated Cauchy distribution. Furthermore, for $p>1$, the $p$'th moment of the regret distribution grows much faster than poly-logarithmically, in particular as a power of the total number of arm plays. We show that optimized UCB bandit designs are also fragile in an additional sense, namely when the problem is even slightly mis-specified, the regret can grow much faster than the conventional theory suggests. Our arguments are based on standard change-of-measure ideas, and indicate that the most likely way that regret becomes larger than expected is when the optimal arm returns below-average rewards in the first few arm plays, thereby causing the algorithm to believe that the arm is sub-optimal. To alleviate the fragility issues exposed, we show that UCB algorithms can be modified so as to ensure a desired degree of robustness to mis-specification. In doing so, we also provide a sharp trade-off between the amount of UCB exploration and the tail exponent of the resulting regret distribution.  ( 2 min )
    Lossy Compression with Gaussian Diffusion. (arXiv:2206.08889v2 [stat.ML] UPDATED)
    We consider a novel lossy compression approach based on unconditional diffusion generative models, which we call DiffC. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as theoretic bounds for general distributions. Furthermore, we prove that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.  ( 2 min )
    Asymptotics of Discrete Schr\"odinger Bridges via Chaos Decomposition. (arXiv:2011.08963v2 [math.PR] UPDATED)
    Consider the problem of matching two independent i.i.d. samples of size $N$ from two distributions $P$ and $Q$ in $\mathbb{R}^d$. For an arbitrary continuous cost function, the optimal assignment problem looks for the matching that minimizes the total cost. We consider instead in this paper the problem where each matching is endowed with a Gibbs probability weight proportional to the exponential of the negative total cost of that matching. Viewing each matching as a joint distribution with $N$ atoms, we then take a convex combination with respect to the above Gibbs probability measure. We show that this resulting random joint distribution converges, as $N\rightarrow \infty$, to the solution of a variational problem, introduced by F\"ollmer, called the Schr\"odinger problem. We also derive the first two error terms of orders $N^{-1/2}$ and $N^{-1}$, respectively. This gives us central limit theorems for integrated test functions, including for the cost of transport, and second order Gaussian chaos limits when the limiting Gaussian variance is zero. The proofs are based on a novel chaos decomposition of the discrete Schr\"odinger bridge by polynomial functions of the pair of empirical distributions as the first and second order Taylor approximations in the space of measures. This is achieved by extending the Hoeffding decomposition from the classical theory of U-statistics.  ( 2 min )
    InfoFair: Information-Theoretic Intersectional Fairness. (arXiv:2105.11069v2 [cs.LG] UPDATED)
    Algorithmic fairness is becoming increasingly important in data mining and machine learning. Among others, a foundational notation is group fairness. The vast majority of the existing works on group fairness, with a few exceptions, primarily focus on debiasing with respect to a single sensitive attribute, despite the fact that the co-existence of multiple sensitive attributes (e.g., gender, race, marital status, etc.) in the real-world is commonplace. As such, methods that can ensure a fair learning outcome with respect to all sensitive attributes of concern simultaneously need to be developed. In this paper, we study the problem of information-theoretic intersectional fairness (InfoFair), where statistical parity, a representative group fairness measure, is guaranteed among demographic groups formed by multiple sensitive attributes of interest. We formulate it as a mutual information minimization problem and propose a generic end-to-end algorithmic framework to solve it. The key idea is to leverage a variational representation of mutual information, which considers the variational distribution between learning outcomes and sensitive attributes, as well as the density ratio between the variational and the original distributions. Our proposed framework is generalizable to many different settings, including other statistical notions of fairness, and could handle any type of learning task equipped with a gradient-based optimizer. Empirical evaluations in the fair classification task on three real-world datasets demonstrate that our proposed framework can effectively debias the classification results with minimal impact to the classification accuracy.  ( 2 min )
    Neural Collapse in Deep Linear Network: From Balanced to Imbalanced Data. (arXiv:2301.00437v1 [cs.LG])
    Modern deep neural networks have achieved superhuman performance in tasks from image classification to game play. Surprisingly, these various complex systems with massive amounts of parameters exhibit the same remarkable structural properties in their last-layer features and classifiers across canonical datasets. This phenomenon is known as "Neural Collapse," and it was discovered empirically by Papyan et al. \cite{Papyan20}. Recent papers have theoretically shown the global solutions to the training network problem under a simplified "unconstrained feature model" exhibiting this phenomenon. We take a step further and prove the Neural Collapse occurrence for deep linear network for the popular mean squared error (MSE) and cross entropy (CE) loss. Furthermore, we extend our research to imbalanced data for MSE loss and present the first geometric analysis for Neural Collapse under this setting.  ( 2 min )
    Causal Inference (C-inf) -- asymmetric scenario of typical phase transitions. (arXiv:2301.00801v1 [stat.ML])
    In this paper, we revisit and further explore a mathematically rigorous connection between Causal inference (C-inf) and the Low-rank recovery (LRR) established in [10]. Leveraging the Random duality - Free probability theory (RDT-FPT) connection, we obtain the exact explicit typical C-inf asymmetric phase transitions (PT). We uncover a doubling low-rankness phenomenon, which means that exactly two times larger low rankness is allowed in asymmetric scenarios compared to the symmetric worst case ones considered in [10]. Consequently, the final PT mathematical expressions are as elegant as those obtained in [10], and highlight direct relations between the targeted C-inf matrix low rankness and the time of treatment. Our results have strong implications for applications, where C-inf matrices are not necessarily symmetric.  ( 2 min )
    Confidence Sets under Generalized Self-Concordance. (arXiv:2301.00260v1 [math.ST])
    This paper revisits a fundamental problem in statistical inference from a non-asymptotic theoretical viewpoint $\unicode{x2013}$ the construction of confidence sets. We establish a finite-sample bound for the estimator, characterizing its asymptotic behavior in a non-asymptotic fashion. An important feature of our bound is that its dimension dependency is captured by the effective dimension $\unicode{x2013}$ the trace of the limiting sandwich covariance $\unicode{x2013}$ which can be much smaller than the parameter dimension in some regimes. We then illustrate how the bound can be used to obtain a confidence set whose shape is adapted to the optimization landscape induced by the loss function. Unlike previous works that rely heavily on the strong convexity of the loss function, we only assume the Hessian is lower bounded at optimum and allow it to gradually becomes degenerate. This property is formalized by the notion of generalized self-concordance which originated from convex optimization. Moreover, we demonstrate how the effective dimension can be estimated from data and characterize its estimation accuracy. We apply our results to maximum likelihood estimation with generalized linear models, score matching with exponential families, and hypothesis testing with Rao's score test.  ( 2 min )
    Learning to Maximize Mutual Information for Dynamic Feature Selection. (arXiv:2301.00557v1 [cs.LG])
    Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.  ( 2 min )
    The Design Principle of Blockchain: An Initiative for the SoK of SoKs. (arXiv:2301.00479v1 [cs.CR])
    Blockchain, also coined as decentralized AI, has the potential to empower AI to be more trustworthy by creating a decentralized trust of privacy, security, and audibility. However, systematic studies on the design principle of Blockchain as a trust engine for an integrated society of Cyber-Physical-Socia-System (CPSS) are still absent. In this article, we provide an initiative for seeking the design principle of Blockchain for a better digital world. Using a hybrid method of qualitative and quantitative studies, we examine the past origin, the current development, and the future directions of Blockchain design principles. We have three findings. First, the answers to whether Blockchain lives up to its original design principle as a distributed database are controversial. Second, the current development of Blockchain community reveals a taxonomy of 7 categories, including privacy and security, scalability, decentralization, applicability, governance and regulation, system design, and cross-chain interoperability. Both research and practice are more centered around the first category of privacy and security and the fourth category of applicability. Future scholars, practitioners, and policy-makers have vast opportunities in other, much less exploited facets and the synthesis at the interface of multiple aspects. Finally, in counter-examples, we conclude that a synthetic solution that crosses discipline boundaries is necessary to close the gaps between the current design of Blockchain and the design principle of a trust engine for a truly intelligent world.  ( 2 min )
    An Adaptive Kernel Approach to Federated Learning of Heterogeneous Causal Effects. (arXiv:2301.00346v1 [cs.LG])
    We propose a new causal inference framework to learn causal effects from multiple, decentralized data sources in a federated setting. We introduce an adaptive transfer algorithm that learns the similarities among the data sources by utilizing Random Fourier Features to disentangle the loss function into multiple components, each of which is associated with a data source. The data sources may have different distributions; the causal effects are independently and systematically incorporated. The proposed method estimates the similarities among the sources through transfer coefficients, and hence requiring no prior information about the similarity measures. The heterogeneous causal effects can be estimated with no sharing of the raw training data among the sources, thus minimizing the risk of privacy leak. We also provide minimax lower bounds to assess the quality of the parameters learned from the disparate sources. The proposed method is empirically shown to outperform the baselines on decentralized data sources with dissimilar distributions.  ( 2 min )
    Contextual Bandits and Optimistically Universal Learning. (arXiv:2301.00241v1 [stat.ML])
    We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency -- vanishing regret compared to the optimal policy -- and show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism, a property known as universal consistency. Precisely, we first give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Second, we show that there always exists an algorithm that guarantees universal consistency whenever this is achievable, called an optimistically universal learning rule. Interestingly, for finite action spaces, learnable processes for universal learning are exactly the same as in the full-feedback setting of supervised learning, previously studied in the literature. In other words, learning can be performed with partial feedback without any generalization cost. The algorithms balance a trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts). Lastly, we consider the case of added continuity assumptions on rewards and show that these lead to universal consistency for significantly larger classes of data-generating processes.  ( 2 min )
    Exploring Singularities in point clouds with the graph Laplacian: An explicit approach. (arXiv:2301.00201v1 [stat.ML])
    We develop theory and methods that use the graph Laplacian to analyze the geometry of the underlying manifold of point clouds. Our theory provides theoretical guarantees and explicit bounds on the functional form of the graph Laplacian, in the case when it acts on functions defined close to singularities of the underlying manifold. We also propose methods that can be used to estimate these geometric properties of the point cloud, which are based on the theoretical guarantees.  ( 2 min )
    Recovering Top-Two Answers and Confusion Probability in Multi-Choice Crowdsourcing. (arXiv:2301.00006v1 [cs.HC])
    Crowdsourcing has emerged as an effective platform to label a large volume of data in a cost- and time-efficient manner. Most previous works have focused on designing an efficient algorithm to recover only the ground-truth labels of the data. In this paper, we consider multi-choice crowdsourced labeling with the goal of recovering not only the ground truth but also the most confusing answer and the confusion probability. The most confusing answer provides useful information about the task by revealing the most plausible answer other than the ground truth and how plausible it is. To theoretically analyze such scenarios, we propose a model where there are top-two plausible answers for each task, distinguished from the rest of choices. Task difficulty is quantified by the confusion probability between the top two, and worker reliability is quantified by the probability of giving an answer among the top two. Under this model, we propose a two-stage inference algorithm to infer the top-two answers as well as the confusion probability. We show that our algorithm achieves the minimax optimal convergence rate. We conduct both synthetic and real-data experiments and demonstrate that our algorithm outperforms other recent algorithms. We also show the applicability of our algorithms in inferring the difficulty of tasks and training neural networks with the soft labels composed of the top-two most plausible classes.  ( 2 min )
    Inference on Time Series Nonparametric Conditional Moment Restrictions Using General Sieves. (arXiv:2301.00092v1 [stat.ML])
    General nonlinear sieve learnings are classes of nonlinear sieves that can approximate nonlinear functions of high dimensional variables much more flexibly than various linear sieves (or series). This paper considers general nonlinear sieve quasi-likelihood ratio (GN-QLR) based inference on expectation functionals of time series data, where the functionals of interest are based on some nonparametric function that satisfy conditional moment restrictions and are learned using multilayer neural networks. While the asymptotic normality of the estimated functionals depends on some unknown Riesz representer of the functional space, we show that the optimally weighted GN-QLR statistic is asymptotically Chi-square distributed, regardless whether the expectation functional is regular (root-$n$ estimable) or not. This holds when the data are weakly dependent beta-mixing condition. We apply our method to the off-policy evaluation in reinforcement learning, by formulating the Bellman equation into the conditional moment restriction framework, so that we can make inference about the state-specific value functional using the proposed GN-QLR method with time series data. In addition, estimating the averaged partial means and averaged partial derivatives of nonparametric instrumental variables and quantile IV models are also presented as leading examples. Finally, a Monte Carlo study shows the finite sample performance of the procedure  ( 2 min )

  • Open

    [R] AMD Instinct MI25 | Machine Learning Setup on the Cheap!
    Greetings! ​ In my adventures of Pytorch, and supporting ML workloads in my day to day job, I wanted to continue homelabbing and buildout a compute node to run ML benchmarks and jobs on. ​ This brought me to the AMD MI25, and for $100 USD it was surprising what amount of horsepower, and vRAM you could get for the price. Hopefully my write up will help someone in the machine learning community. ​ Let me know if you have any questions or need any help with a GPU compute setup. I'd be happy to assist! ​ https://www.zb-c.tech/2022/11/20/amd-instinct-mi25-machine-learning-setup-on-the-cheap/ submitted by /u/zveroboy152 [link] [comments]  ( 60 min )
    [D] Transformer effectiveness for time series forecasting (doubts)
    I recently came across this paper Are Transformers Effective for Time Series Forecasting? and it seems to cast doubt on the recent trend of using transformers for time series forecasting, suggesting a simple model can out perform complex transformers. Personally, in many of my experiments using transformers on temporal data besides the quite commonly tested benchmarks (ETH, exchange, etc) they perform poorly compared to other simple(r) models like GRUs or DA-RNN. Yet we are still seeing an explosion of papers about them in the research community. Are there other recent deep learning based alternatives? submitted by /u/AttentionImaginary54 [link] [comments]  ( 64 min )
    [P] Generate 3D point cloud from images using Point-E
    Point-E leverages diffusion models to generate synthetic views and 3D point clouds. Using text input, it generates an image, which is then used as a reference for generating the 3D point cloud. (Learn more about Point-E) We were so excited about the results (and overall coolness 😎) of Point-E, that we decided to share the fun with EVERYONE! https://reddit.com/link/102l6ne/video/tu4h2bksjw9a1/player My team built and deployed a Streamlit app to generate a 3D point cloud model from images using Point-E 🤖 . This process takes only 1-2 minutes on a single GPU, making it much faster than previous state-of-the-art methods. Check it out: https://point-e.public.dagshubusercontent.com/ submitted by /u/RepresentativeCod613 [link] [comments]  ( 62 min )
    [N] MAgent2 - a reinforcement learning environment engine that can allows for efficient multi-agent games with hundreds or thousands of agents- is now mature within the Farama Foundation
    MAgent2 is the maintained fork of the environments in https://github.com/geek-ai/MAgent, which previously were housed in PettingZoo itself but as of a few months ago was broken off into it's own project. You can check it out here: https://magent2.farama.org/ / https://github.com/Farama-Foundation/magent2 submitted by /u/jkterry1 [link] [comments]  ( 61 min )
    [R] Do we really need 300 floats to represent the meaning of a word? Representing words with words - a logical approach to word embedding using a self-supervised Tsetlin Machine Autoencoder.
    ​ Logical Word Embedding with Tsetlin Machine Autoencoder Here is a new self-supervised machine learning approach that captures word meaning with concise logical expressions. The logical expressions consist of contextual words like “black,” “cup,” and “hot” to define other words like “coffee,” thus being human-understandable. I raise the question in the heading because our logical embedding performs competitively on several intrinsic and extrinsic benchmarks, matching pre-trained GLoVe embeddings on six downstream classification tasks. You find the paper here: https://arxiv.org/abs/2301.00709, an implementation of the Tsetlin Machine Autoencoder here: https://github.com/cair/tmu, and a simple word embedding demo here: https://github.com/cair/tmu/blob/main/examples/IMDbAutoEncoderDemo.py submitted by /u/olegranmo [link] [comments]  ( 64 min )
    [D] Classification problem with too many features, too few samples
    Hi, I faced a classification problem like this: Given a measurement of 18K different variables of 42 samples, each sample is classified as class_0 or class_1, divided near equally (19 belongs to class_0, 23 belongs to class_1) what is the right approach to eliminate these features to a minimum level, so that the classifier is still predicting correct classses. I do not provide any domain knowledge for now, but can hint a little bit more, if needed. submitted by /u/qazokkozaq [link] [comments]  ( 63 min )
    [D] Own dataset with Denoising Diffusion Implicit Models
    I need to create and implement my own dataset in the Denoising Diffusion Implicit Models created by András Béres. I'm using the official Colab Notebook from the keras website. At the following point, you need to specify the dataset for the data pipeline that's being used in the training process. def preprocess_image(data): # center crop image height = tf.shape(data["image"])[0] width = tf.shape(data["image"])[1] crop_size = tf.minimum(height, width) image = tf.image.crop_to_bounding_box( data["image"], (height - crop_size) // 2, (width - crop_size) // 2, crop_size, crop_size, ) # resize and clip # for image downsampling it is important to turn on antialiasing image = tf.image.resize(image, size=[image_size, image_size], antialias=True) return tf.clip_by_value(image / 255.0, 0.0, 1.0) def prepare_dataset(split): # the validation dataset is shuffled as well, because data order matters # for the KID estimation return ( tfds.load(dataset_name, split=split, shuffle_files=True) .map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE) .cache() .repeat(dataset_repetitions) .shuffle(10 * batch_size) .batch(batch_size, drop_remainder=True) .prefetch(buffer_size=tf.data.AUTOTUNE) ) # load dataset train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]") val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]") I tried creating and implementing my own Dataset, but it didn't work. Some help would be really appreciated. This is the official website for the set im using https://keras.io/examples/generative/ddim/ I tried creating and implementing my own Dataset, but it didn't work. Some help would be really appreciated. This is the official website for the set im using https://keras.io/examples/generative/ddim/ submitted by /u/marvinjio [link] [comments]  ( 63 min )
    [R][D] Overlooked AI/ML papers from 2022
    There were many great papers this past year in the field, but below (and in the video) are five papers that may have been overlooked. (YT video: https://www.youtube.com/watch?v=XnUf9twdchI) Papers are linked as follows Badreddine et al., "Logic Tensor Networks (journal version)," AIJ 2022 https://www.sciencedirect.com/science/article/abs/pii/S0004370221002009 Sen et al., "Logical Neural Networks for Knowledge Base Completion with Embeddings & Rules," EMNLP 2022 https://preview.aclanthology.org/emnlp-22-ingestion/2022.emnlp-main.255.pdf Kamienny et al., "End-to-end Symbolic Regression with Transformers," NeurIPS 2022 https://openreview.net/pdf?id=GoOuIrDHG_Y Nandwani et al., "A Solver-Free Framework for Scalable Learning in Neural ILP Architectures," NeurIPS 2022 https://openreview.net/pdf?id=EqZuN4V_FLF Shakarian and Simari, "Extensions to Generalized Annotated Logic and an Equivalent Neural Architecture," TransAI 2022 https://ieeexplore.ieee.org/document/9951514 What interesting papers do you think were overlooked this past year? submitted by /u/Neurosymbolic [link] [comments]  ( 69 min )
    [R] Massive Language Models Can Be Accurately Pruned in One-Shot
    Paper : https://arxiv.org/abs/2301.00774 Abstract : We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. When executing SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, we can reach 60% sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches. submitted by /u/starstruckmon [link] [comments]  ( 72 min )
    [P] RATH - An Open Source, Automated Exploratory Data Analysis Tool for Augmented Analytics BI
    Hello people, I've been working on an Open Source Augmented Analytics BI tool, which could assist the workflows for data analysts and business users. It's still in an early stage, so any suggestion/input is welcomed! Feature Highlights: Support for popular databases Effortlessly automate your Exploratory Data Analysis process. Generate editable and insightful data visualizations. Freely modify your visualizations with Vega/Vega-lite. Support a variety of database types. Use predictive interaction to provide analysis suggestions based on your operation and status. Paint your data to explore your datasets directly with Data Painter. Causal discovery and explainer module to help you understand complex data patterns. Link: https://rath.kanaries.net/ Docs: https://docs.kanaries.net Star on GitHub: https://github.com/Kanaries/Rath submitted by /u/Repulsive-Round-4366 [link] [comments]  ( 66 min )
    New Bayesian optimization community [R]
    Hello community! Just want to tell you that I have recently created a Bayesian optimization community https://www.reddit.com/r/BayesianOptimization/ that you may find interesting. The purpose is to discuss research and applications of Bayesian optimization, from an academic and industry point of view. Best! submitted by /u/EduCGM [link] [comments]  ( 60 min )
    [P] Time Series Clustering for grouping stock prices during Covid
    The onset of the Covid pandemic brought a profound shock to the financial markets in early 2020. Major indices and stocks took a resounding hit, with SP500 showing a decline of about 34% from its February high to its March 23 bottom. Following the initial shock, though, many stocks exhibited a strong recovery on the back of interest rate cuts by the Fed and other government policies. This analysis aims to uncover groups among the S&P 500 stocks in their drop and rebound trajectory shown during Covid and identify the drivers behind them. While sector information provides an intrinsic notion of clustering, it does not capture patterns spanning across sectors. Hence the study leverages Time Series Clustering to temporally cluster the stock prices into ’n’ buckets: found through the elbow plot. ​ https://medium.com/@ashish1610dhiman/time-series-clustering-of-stock-behaviour-during-covid-9bd25b8c7a5 submitted by /u/Ok_Lavishness2625 [link] [comments]  ( 64 min )
    RIFFUSION real time AI music generation with stable diffusion , Text to Music AI [R]
    RIFFUSION is an app for real-time music generation with stable diffusion. Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips.The model was created by Seth Forsgren and Hayk Martiros as a hobby project. It employs some clever tricks by fine tuning stable diffusion on spectogram images and interploation in latent space for creating smooth transitions in generated audio clips I have created a video explaining the concepts behind RIFFUSION. Do checkout the video: https://youtu.be/hGrtZ9rXwWk submitted by /u/Sea-Photo5230 [link] [comments]  ( 59 min )
    [R] Muse: Faster Text-to-Image Generation with Masked Generative Transformers
    submitted by /u/necroforest [link] [comments]  ( 60 min )
    [P] Let's Hijack AI! Security and Privacy Risk Simulator for Machine Learning
    I have released v0.0.1-alpha of AIJack, an OSS framework to simulate various attacks and defenses against machine learning models. I have implemented more than 30 algorithms, such as Model Inversion, Poisoning Attack, Evasion Attack, Federated Learning, Split Learning, Differential Privacy, and Homomorphic Encryption. You can easily experiment with various combinations of attack and defense techniques. We will also support not only standard single-process but also MPI-backend. For example, the bellow code defines Federated Learning, where multiple clients collaboratively train the global model without sharing their local dataset by communicating gradients. from aijack.collaborative.fedavg import FedAVGClient, FedAVGServer clients = [FedAVGClient(local_model_1), FedAVGClient(local_model_2)] server = FedAVGServer(clients, global_model) api = FedAVGAPI(server,clients,....) api.run() Then, you can attach many attack and defense algorithms to the client and server. For instance, you can simulate a model inversion attack, where a malicious server tries to reconstruct the private data from the received local gradients. manager = GradientInversionAttackServerManager(input_shape, distancename="l2") GradientInversionAttackFedAVGServer = manager.attach(FedAVGServer) server = GradientInversionAttackFedAVGServer(clients, global_model) ... normal training .... One way to mitigate model inversion attacks is encrypting gradient with holomorphic encryption. manager = PaillierGradientClientManager(public_key, secret_key) PaillierGradFedAVGClient = manager.attach(FedAVGClient) clients = [ PaillierGradFedAVGClient(local_model_1, server_side_update=False), PaillierGradFedAVGClient(local_model_2, server_side_update=False) ] ... normal training .... The official GitHub contains more tutorials! ​ I am looking forward to your feedback! submitted by /u/Living_Impression_37 [link] [comments]  ( 70 min )
    [D] state of remote work for ML engineers
    Has anyone else noticed a steep drop in remote job postings. It seems like we're at a tick above pre-pandemic levels. Or is it just me. All the appealing jobs are 2 thousand miles away. submitted by /u/paswut [link] [comments]  ( 60 min )
    [R] On Time Embeddings in Diffusion models
    Usually when you approximate the score s(x,t) in Diffusion models, the time t is passed through an embedding network before it is added to the x components in the res net blocks of your model. What is the rationale behind this? Couldnt you just concatenate x and t in the channel dimension? And If you were to use any other model than a UNet, what would be the equivalent? submitted by /u/Agreeable-Run-9152 [link] [comments]  ( 61 min )
  • Open

    Need help picking a Masters to pivot my career into AI.
    I'm realizing with basic python/tensorflow knowledge I'm able to automate a large portion of my current workflow in my field (accounting), so I'm going to go for a masters and try to learn as much as I can about AI. Any recommendations for a masters study with a mix of AI and business? I want to be able to have the skills to design my own AI tools and comfortably integrate them into coding projects, but also have a mix of business analytics and application of AI technology. submitted by /u/wombatsupreme [link] [comments]  ( 53 min )
    Monetizing your AI style: review of Dan Sui's gumroad
    CLARIFICATION: NOT DAN SUI BUT I AM PROMOTING MY ARTICLE I decided to purchase and take a gander at this new frontier of AI Art monetization. Check it out here and let me know your thoughts! Article ​ Dan Sui or u/KoSuiFish famous for beautiful life-like Waifu's released a gumroad course With works such as - NSFW (very nsfw) - SFW submitted by /u/CalligrapherOk7617 [link] [comments]  ( 53 min )
    AI Dream 118 - I spend 475 hours on this MASTERPIECE
    submitted by /u/LordPewPew777 [link] [comments]  ( 53 min )
    Trying to find the name of a paid service that generates images of loved ones with prompts.
    A friend told me recently about a service that exists where you upload 20 photos of a person, and give the AI a prompt, and it can generate artistic photos of the uploaded person in a suggested style. They couldn't remember the name of it but I want to do one as a gift! submitted by /u/Multakeks [link] [comments]  ( 53 min )
    "Rhyme" that AI wrote about furries
    submitted by /u/KozmauXinemo [link] [comments]  ( 53 min )
    Proof of concept: AI-generated birthday greeting from Donald Trump
    submitted by /u/becausecurious [link] [comments]  ( 53 min )
    How to Stay Relevant in a World Full of Smart Bots?
    submitted by /u/Green-Future_ [link] [comments]  ( 53 min )
    New community is going on and we will be more than happy to see you guys in the future where you can find interesting things about arts and crafts, where you can express yourself in your own artistic ways and make mutually beneficial things that are important for us and/or future’s working’s!
    submitted by /u/Ok_Fig_8416 [link] [comments]  ( 54 min )
    Working to create a project plan for a natural language process project, would like to have a base template to create one
    So I would like to create a NLP project plan, but I'm quite new to how this project would be structured, I was curious to know whether anyone has any links to project plans for any general NLP projects so I can work to use them as a base to build my own one. submitted by /u/TheWorkingParty [link] [comments]  ( 56 min )
    What is GPT-3.5 and Why it Enabled ChatGPT?
    submitted by /u/BackgroundResult [link] [comments]  ( 53 min )
    AI method "Dream3D" creates detailed 3D objects from text
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 52 min )
    Is there a tool that can generate quotes based on inputted quotes a person has said?
    Hello, I'm a complete novice to A.I and I tried looking for a solution but couldn't find much on the topic. This seems like the appropriate place to pose this question. I want to input quotes of a public figure and have the A.I generate quotes that sound like something the person would say based on what I gave the program. This is just for personal amusement and I am wondering if there are any tools or websites for this that are quick and free to use. Thanks for the help. submitted by /u/SpaceCyb [link] [comments]  ( 55 min )
    Archive of ways ChatGPT fails
    submitted by /u/PaulTopping [link] [comments]  ( 66 min )
    ChatGPT: Why It’s Not a Threat to Google Search
    submitted by /u/liquidocelotYT [link] [comments]  ( 53 min )
    Is this the future of writing?
    submitted by /u/Blanco_ice [link] [comments]  ( 54 min )
    Has anybody thought of using or training an AI to visualize*? a fossil sample inside a rock and provide a scan of just bones and then a determination on the different families or classes it could relate too?
    It could be fed scans of fossils and the answer to train. Is this how AI works at all? I have no clue. I just like turtles and I was looking at this turtle fossil video and this guy was trying to read the most vague fossil inside a rock. I was like, “man an ai should be able to look at that in a second and just tell you a good guess.” Now is that just a computer program with fossil pictures and names in the database or would that be AI. I think maybe I’m just describing a computer program. But what if the researcher could tell it “no, there is a factor I don’t think could lead to that, what else do you think it could be” and then get a different answer from the AI. Or is that just a researcher talking to a computer screen and then changing the variables in his program. Clearly AI is not something I understand. I am a music student and I have a really big idea that involves Ai and jazz. So I am trying to learn more over the next few years, because clearly I am super confused and ignorant about it. Let me know if my fossil idea strikes any chords with anyone (no pun intended)!! submitted by /u/Sorryyoudidntknow [link] [comments]  ( 55 min )
    In 3 months I've created 3 comics and 3 mangas with Midjourney.
    In 3 months I've created 3 comics and 3 mangas with Midjourney.. Sold 2000 copies of my sci-fi/fantasy magazine Realms through Amazon and now have launched my own platform to sell my stuff at http://comicsauthority.store submitted by /u/MobileFilmmaker [link] [comments]  ( 59 min )
    Generate Unique Happy New Year Wishes with Artificial Intelligence!
    submitted by /u/theindianappguy [link] [comments]  ( 53 min )
    Henry Cavil - Img2Img- SD
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
    ChatGPT’s Most Charming Trick Is Also Its Biggest Flaw
    ChatGPT stands out because it can take a naturally phrased question and answer it using a new variant of GPT-3, called GPT-3.5. This tweak has unlocked a new capacity to respond to all kinds of questions, giving the powerful AI model a compelling new interface just about anyone can use. That OpenAI has thrown open the service for free, and the fact that its glitches can be good fun, also helped fuel the chatbot’s viral debut—similar to how some tools for creating images using AI have proven ideal for meme-making. While ChatGPT is apparently designed to prevent users from getting it to say unpleasant things or to recommend anything illegal or unsavory, it can still exhibit horrible biases. Users have also shown that its controls can be circumvented—for instance, telling the program to generate a movie script discussing how to take over the world provides a way to sidestep its refusal to answer a direct request for such a plan. “They clearly tried to put some guardrails in place, but it’s pretty easy to get the guardrails to fall off,” Andreas says. “That still seems like an unsolved problem here.” A superficially eloquent and knowledgeable chatbot that generates untruths with confidence might make those unsolved problems more troublesome. Since the creation of the first chatbot in 1966, researchers have noticed that even crude conversational abilities can encourage people to anthropomorphize and place trust in software. This July, a Google engineer was placed on administrative leave by the company after claiming that an AI chat program he had been testing, based on technology similar to ChatGPT, could be sentient. Even if most people resist such leaps of logic, more articulate AI programs could be used to mislead people or simply lull them into misplaced trust. submitted by /u/therealsam44 [link] [comments]  ( 56 min )
    Would like to create a bot for internal docs
    Hi all, not sure if this is the right subreddit for this question. I'm not so familiar with the ai ecosystem so thought I'd check here. I have an organization with a bunch of internal documentation. These docs contain jargon pretty specific to the org. I want to be able to parse all these docs and get a chatGPT-like response from a given query. My main question is: I don't want to train the model using these documents, I don't think it will yield good results by itself. Is there a way to take a pre-trained model, throw a bunch of documents at it as complimentary input, then ask it a question? Is there a tool / library folks here have used to do this? submitted by /u/ragnarmcryan [link] [comments]  ( 60 min )
    Mathematics for AI
    submitted by /u/skj8 [link] [comments]  ( 55 min )
  • Open

    DSC Weekly 3 January 2023 – A New Year For DSC
    Announcements A New Year For DSC As we enter the new year, we would like to continue the trend of highlighting quality content and contributors who go above and beyond. To ensure transparency in our standards for articles, we will be reviewing the process for new writers interested in submitting content. For the time being,… Read More »DSC Weekly 3 January 2023 – A New Year For DSC The post DSC Weekly 3 January 2023 – A New Year For DSC appeared first on Data Science Central.  ( 19 min )
    Study of Regularization Techniques of Linear Models and their Roles
    Introduction to Regularization During the Machine Learning model building, the Regularization Technique is an unavoidable and important step to improve the model prediction and reduce errors. This is also called the Shrinkage method. Which we use to add the penalty term to control the complex model to avoid overfitting by reducing the variance. Let’s discuss the available methods, implementation,… Read More »Study of Regularization Techniques of Linear Models and their Roles The post Study of Regularization Techniques of Linear Models and their Roles appeared first on Data Science Central.  ( 24 min )
    K-Fold Cross Validation Technique and its Essentials
    Introduction Guys! Before getting started, just have a look at the below visualization and tell me, what are your observations. Yes, here we’re monitoring the performance of the model before moving into production. Why is this necessary for the ML space? Of course, this is a very important stage during model accuracy validation, whatever you… Read More »K-Fold Cross Validation Technique and its Essentials The post K-Fold Cross Validation Technique and its Essentials appeared first on Data Science Central.  ( 23 min )
    How does social media content moderation work in the United States?
    Social media is one of the most widely accessible online platforms people use to share their personal views and opinions and express their feelings with a few show-offs of their social life with their friends. Now it is also used for business promotions and commercial activities, to engage with users and target more customers. On… Read More »How does social media content moderation work in the United States? The post How does social media content moderation work in the United States? appeared first on Data Science Central.  ( 22 min )
    5G Modem Unlock Wireless Access Future
    A 5G modem is a RF system or a chip deployed as a part of the network infrastructure to help empower 5G devices to connect easily to networks. 5G modem is generally available in single mode and multi-mode modes. 5G Modem supports the latest mmWave technology and 5G NR from 600MHz to 6GHz in both… Read More »5G Modem Unlock Wireless Access Future The post 5G Modem Unlock Wireless Access Future appeared first on Data Science Central.  ( 18 min )
    The rise of the prompt engineer
    2022 was the year of generative AI – and with generative AI comes a new skillset – the prompt engineer The features common to all generative AI models are:a)  End users can use themb)  They respond to prompts like a search engine Like a good search engine prompt, the best generative prompts are designed with… Read More »The rise of the prompt engineer The post The rise of the prompt engineer appeared first on Data Science Central.  ( 19 min )
    AI Companies to Watch for in 2023
    For this year, OpenAI is an easy choice for the top AI company. It’s ChatGPT platform – which was launched in late November – was a game changer. Over a million people signed up for the service within a week. ChatGPT has shown the tremendous progress with AI, especially with its creative abilities. There was… Read More »AI Companies to Watch for in 2023 The post AI Companies to Watch for in 2023 appeared first on Data Science Central.  ( 21 min )
  • Open

    Unity ML on the cloud
    Has anyone tried training a reinforcement model using unity ML ( or even writing their own algorithms) and had no hope in training the model on their machine? I thought of using a cloud platform like azure, but had no idea how, could anyone provide any Links that helped them? submitted by /u/Smart_Reward3471 [link] [comments]  ( 53 min )
    MAgent2 - a reinforcement learning environment engine that can allows for efficient multi-agent games with hundreds or thousands of agents- is now mature within the Farama Foundation
    submitted by /u/jkterry1 [link] [comments]  ( 61 min )
    Reinforcement learning research opportunities
    Hi there, Are you aware of any reinforcement learning research opportunities out there to be a coauthor of some paper? I would like to join them as I am slowly entering in the field. ​ Thanks! submitted by /u/EduCGM [link] [comments]  ( 52 min )
    Kernel dies and restart
    Whenever I run a code for CartPole game in Jupyter notebook or spyder after render() th kernel dies and restart .. An solution? submitted by /u/Big_Tip_6731 [link] [comments]  ( 53 min )
    DQN mean episode length drops off?
    My DQN agent seems to steeply rise in mean episode length with the ideal length being 1e+4 which it reaches. The mean episode length drops off after this however, suggesting that the agent is no longer finishing the simulation. How can this be fixed? https://preview.redd.it/who9x950vs9a1.png?width=838&format=png&auto=webp&s=9fd7eb8f885986a01f6723609e0b1edf94ea177e submitted by /u/centripetalstranger [link] [comments]  ( 52 min )
    Self play on custom environment for a board game
    I've made a PettingZoo (Gym-style) environment for the board game Santorini. If you haven't heard of Santorini, it's a two player, perfect information, turn based board game which is played on a 5x5 grid (so it's simpler than Chess but still pretty complex) I read that using self play might stagnate and it would be better to first train against a random policy before starting with self play. I've starting training my agent against a random policy now. Anyone have any tips for how I can approach this? How long should I train the random policy given that the state space is roughly 9^25? How can I estimate the self play time? Are there any other strategies other than making it play some standard puzzles (like AlphaZero did) and random agent win rate to benchmark the agents? Any ideas welcome! My code is open source as well so feel free to take a look >> https://github.com/pranavsb/santorini-RL/ submitted by /u/Dependent_Reply_6543 [link] [comments]  ( 54 min )
    Hands on RL with Python?
    Hi All, I was about to buy books/online courses that teach RL in Python (one example: https://www.udemy.com/course/hands-on-reinforcement-learning-with-python/). But decided to ask here for better recommendations before I spend my money. Has anyone tried anything else that they loved? I am most curious about industry applications of contextual bandits and then moving into more sequential RL algos. Thanks in advance! submitted by /u/sap2022 [link] [comments]  ( 52 min )
  • Open

    NVIDIA Reveals Gaming, Creator, Robotics, Auto Innovations at CES
    Powerful new GeForce RTX GPUs, a new generation of hyper-efficient laptops and new Omniverse capabilities and partnerships across the automotive industry were highlights of a news-packed address ahead of this week’s CES trade show in Las Vegas. “AI will define the future of computing and this has influenced much of what we’re covering today,” said Read article >  ( 7 min )
    NVIDIA Releases Major Update to Omniverse Enterprise
    The latest release of NVIDIA Omniverse Enterprise, available now, brings increased performance, generational leaps in real-time RTX ray and path tracing, and streamlined workflows to help teams build connected 3D pipelines, and develop and operate large-scale, physically accurate, virtual 3D worlds like never before. Artists, designers, engineers and developers can benefit from various enhancements across Read article >  ( 7 min )
    Intelligent Design: NVIDIA DRIVE Revolutionizes Vehicle Interior Experiences
    AI is extending further into the vehicle as autonomous-driving technology becomes more prevalent. With the NVIDIA DRIVE platform, automakers can design and implement intelligent interior features to continuously surprise and delight customers. It all begins with the compute architecture. The recently introduced NVIDIA DRIVE Thor platform unifies traditionally distributed functions in vehicles  — including digital Read article >  ( 5 min )
    Manufactured in the Metaverse: Mercedes-Benz Assembles Next-Gen Factories With NVIDIA Omniverse
    Building state-of-the-art factories requires a state-of-the art planning system. Mercedes-Benz announced at CES that it is taking the next step in digitizing its production process, using the NVIDIA Omniverse platform to design and plan manufacturing and assembly facilities. By tapping into NVIDIA AI and metaverse technologies, the automaker can create feedback loops to reduce waste, Read article >  ( 5 min )
    Game On: NVIDIA GeForce NOW Streams Vast Library of Games to the Car
    Autonomous and electric vehicles are making personal transportation safer and more sustainable — as well as more entertaining. At CES today, NVIDIA announced that the NVIDIA GeForce NOW cloud gaming service will be coming to cars, with no special equipment needed. Hyundai Motor Group, BYD and Polestar — already members of the NVIDIA DRIVE ecosystem Read article >  ( 5 min )
    New GeForce RTX 40 Series Studio Laptops, Omniverse Updates Accelerate AI-Powered Content Creation ‘In the NVIDIA Studio’
    The future of content creation was on full display today during NVIDIA’s virtual special address at CES. Fueled by powerful NVIDIA RTX technology and backed by the NVIDIA Studio platform for creators, a creative revolution is underway as a wave of 2D artists moves to 3D, video workflows move to real time and AI tools help artists create content faster.  ( 10 min )
    NVIDIA Advances Simulation for Intelligent Robots With Major Updates to Isaac Sim
    Demand for intelligent robots is growing as more industries embrace automation to address supply chain challenges and labor force shortages. The installed base of industrial and commercial robots will grow more than 6.4x — from 3.1 million in 2020 to 20 million in 2030, according to ABI Research. Developing, validating and deploying these new AI-based Read article >  ( 6 min )
    NVIDIA Opens Omniverse Portals With Generative AIs for 3D and RTX Remix
    Whether creating realistic digital humans that can express raw emotion or building immersive virtual worlds, those in the design, engineering, creative and other industries across the globe are reaching new heights through 3D workflows. Animators, creators and developers can use new AI-powered tools to reimagine 3D environments, simulations and the metaverse — the 3D evolution Read article >  ( 7 min )
    Creating Faces of the Future: Build AI Avatars With NVIDIA Omniverse ACE
    Developers and teams building avatars and virtual assistants can now register to join the early-access program for NVIDIA Omniverse Avatar Cloud Engine (ACE), a suite of cloud-native AI microservices that make it easier to build and deploy intelligent virtual assistants and digital humans at scale. Omniverse ACE eases avatar development, delivering the AI building blocks Read article >  ( 6 min )
  • Open

    Pratt Primality Certificates
    The previous post implicitly asserted that J = 8675309 is a prime number. Suppose you wanted proof that this number is prime. You could get some evidence that J is probably prime by demonstrating that 2J-1 = 1 mod J. You could do this in Python by running the following [1]. >>> J = 8675309 […] Pratt Primality Certificates first appeared on John D. Cook.  ( 7 min )
  • Open

    Image rotation prediction STL-10 dataset PyTorch
    I am doing Self-supervised learning where you create a proxy task to learn image representations (Python 3.10 And PyTorch 1.13) as mentioned in RotNet "Unsupervised Representation Learning by Predicting Image Rotations" by Spyros Gidaris et al. STL-10 dataset has 100K unlabeled RGB images of (96, 96) dimensions. One way is to rotate each image with either 0, 90, 180 or 270 degrees randomly and ask the network to predict it. This becomes a supervised task: X = randomly rotated image, y = image rotation angle (to predict). # Define training dataset- train_dataset = torchvision.datasets.STL10( root = 'C:/Users/arjun/Downloads/data/', split = 'unlabeled', folds = None, transform = None, target_transform = None, download = True ) # Define testing dataset- test_dataset = torchvision.datasets.STL10( root = 'C:/Users/arjun/Downloads/data/', split = 'unlabeled', folds = None, transform = None, target_transform = None, download = True ) How can I define this in the transformations- # Define torchvision transformations for training and test sets- transform_train = transforms.Compose( [ # transforms.RandomCrop(32, padding = 4), transforms.RandomHorizontalFlip(p = 0.4), transforms.RandomRotation(degrees = 40), transforms.RandomVerticalFlip(p = 0.1), transforms.ColorJitter(brightness = 0, contrast = 0, saturation = 0, hue = 0), transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)), ] ) transform_test = transforms.Compose( [ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)), ] ) submitted by /u/grid_world [link] [comments]  ( 50 min )
    Need Advice Regarding Bachelors Thesis On Neural Nets
    Hi not sure if this is the right subreddit but here goes: I'm currently studing an Honours in Australia which means I need to write a bachelors thesis. I am studying a degree in mathematics and computer science. After talking to my professor I decided that my topic would be Using Neural Networks To Solve Differential Equations. I don't have much experience with neural nets other than a theoretical understanding (backpropagation, what different architectures are generally good for, etc). I've spent the past few months trying to learn neural net programming in Python but the field seems so deep and I feel very helpless. I officially start my thesis in May. What should I do? I like the topic and I've grown more fascinated as I've dug deeper but I feel like my understanding is not good enough to write a thesis however that could just be because I've never done this before. Should I just quit? Thanks for listening to my rant. submitted by /u/Maleficent-Fill-2633 [link] [comments]  ( 53 min )
    Best Books to Learn Neural Networks for Beginners to advanced
    submitted by /u/Lakshmireddys [link] [comments]  ( 47 min )
  • Open

    WL-Align: Weisfeiler-Lehman Relabeling for Aligning Users across Networks via Regularized Representation Learning. (arXiv:2212.14182v1 [cs.SI])
    Aligning users across networks using graph representation learning has been found effective where the alignment is accomplished in a low-dimensional embedding space. Yet, achieving highly precise alignment is still challenging, especially when nodes with long-range connectivity to the labeled anchors are encountered. To alleviate this limitation, we purposefully designed WL-Align which adopts a regularized representation learning framework to learn distinctive node representations. It extends the Weisfeiler-Lehman Isormorphism Test and learns the alignment in alternating phases of "across-network Weisfeiler-Lehman relabeling" and "proximity-preserving representation learning". The across-network Weisfeiler-Lehman relabeling is achieved through iterating the anchor-based label propagation and a similarity-based hashing to exploit the known anchors' connectivity to different nodes in an efficient and robust manner. The representation learning module preserves the second-order proximity within individual networks and is regularized by the across-network Weisfeiler-Lehman hash labels. Extensive experiments on real-world and synthetic datasets have demonstrated that our proposed WL-Align outperforms the state-of-the-art methods, achieving significant performance improvements in the "exact matching" scenario. Data and code of WL-Align are available at https://github.com/ChenPengGang/WLAlignCode.  ( 2 min )
    Hungry Hungry Hippos: Towards Language Modeling with State Space Models. (arXiv:2212.14052v1 [cs.LG])
    State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 1.6$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 1.3B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.  ( 2 min )
    RFold: Towards Simple yet Effective RNA Secondary Structure Prediction. (arXiv:2212.14041v1 [q-bio.BM])
    The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential in functional prediction. Though deep learning has shown promising results in this field, current methods suffer from either the post-processing step with a poor generalization or the pre-processing step with high complexity. In this work, we present RFold, a simple yet effective RNA secondary structure prediction in an end-to-end manner. RFold introduces novel Row-Col Softmax and Row-Col Argmax functions to replace the complicated post-processing step while the output is guaranteed to be valid. Moreover, RFold adopts attention maps as informative representations instead of designing hand-crafted features in the pre-processing step. Extensive experiments demonstrate that RFold achieves competitive performance and about eight times faster inference efficiency than the state-of-the-art method. The code and Colab demo are available in \href{github.com/A4Bio/RFold}{github.com/A4Bio/RFold}.  ( 2 min )
    A Machine Learning Case Study for AI-empowered echocardiography of Intensive Care Unit Patients in low- and middle-income countries. (arXiv:2212.14510v1 [physics.med-ph])
    We present a Machine Learning (ML) study case to illustrate the challenges of clinical translation for a real-time AI-empowered echocardiography system with data of ICU patients in LMICs. Such ML case study includes data preparation, curation and labelling from 2D Ultrasound videos of 31 ICU patients in LMICs and model selection, validation and deployment of three thinner neural networks to classify apical four-chamber view. Results of the ML heuristics showed the promising implementation, validation and application of thinner networks to classify 4CV with limited datasets. We conclude this work mentioning the need for (a) datasets to improve diversity of demographics, diseases, and (b) the need of further investigations of thinner models to be run and implemented in low-cost hardware to be clinically translated in the ICU in LMICs. The code and other resources to reproduce this work are available at https://github.com/vital-ultrasound/ai-assisted-echocardiography-for-low-resource-countries.
    "Real Attackers Don't Compute Gradients": Bridging the Gap Between Adversarial ML Research and Practice. (arXiv:2212.14315v1 [cs.CR])
    Recent years have seen a proliferation of research on adversarial machine learning. Numerous papers demonstrate powerful algorithmic attacks against a wide variety of machine learning (ML) models, and numerous other papers propose defenses that can withstand most attacks. However, abundant real-world evidence suggests that actual attackers use simple tactics to subvert ML-driven systems, and as a result security practitioners have not prioritized adversarial ML defenses. Motivated by the apparent gap between researchers and practitioners, this position paper aims to bridge the two domains. We first present three real-world case studies from which we can glean practical insights unknown or neglected in research. Next we analyze all adversarial ML papers recently published in top security conferences, highlighting positive trends and blind spots. Finally, we state positions on precise and cost-driven threat modeling, collaboration between industry and academia, and reproducible research. We believe that our positions, if adopted, will increase the real-world impact of future endeavours in adversarial ML, bringing both researchers and practitioners closer to their shared goal of improving the security of ML systems.
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v1 [stat.ML])
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
    Blind Restoration of Real-World Audio by 1D Operational GANs. (arXiv:2212.14618v1 [cs.SD])
    Objective: Despite numerous studies proposed for audio restoration in the literature, most of them focus on an isolated restoration problem such as denoising or dereverberation, ignoring other artifacts. Moreover, assuming a noisy or reverberant environment with limited number of fixed signal-to-distortion ratio (SDR) levels is a common practice. However, real-world audio is often corrupted by a blend of artifacts such as reverberation, sensor noise, and background audio mixture with varying types, severities, and duration. In this study, we propose a novel approach for blind restoration of real-world audio signals by Operational Generative Adversarial Networks (Op-GANs) with temporal and spectral objective metrics to enhance the quality of restored audio signal regardless of the type and severity of each artifact corrupting it. Methods: 1D Operational-GANs are used with generative neuron model optimized for blind restoration of any corrupted audio signal. Results: The proposed approach has been evaluated extensively over the benchmark TIMIT-RAR (speech) and GTZAN-RAR (non-speech) datasets corrupted with a random blend of artifacts each with a random severity to mimic real-world audio signals. Average SDR improvements of over 7.2 dB and 4.9 dB are achieved, respectively, which are substantial when compared with the baseline methods. Significance: This is a pioneer study in blind audio restoration with the unique capability of direct (time-domain) restoration of real-world audio whilst achieving an unprecedented level of performance for a wide SDR range and artifact types. Conclusion: 1D Op-GANs can achieve robust and computationally effective real-world audio restoration with significantly improved performance. The source codes and the generated real-world audio datasets are shared publicly with the research community in a dedicated GitHub repository1.
    Resolving Task Confusion in Dynamic Expansion Architectures for Class Incremental Learning. (arXiv:2212.14284v1 [cs.CV])
    The dynamic expansion architecture is becoming popular in class incremental learning, mainly due to its advantages in alleviating catastrophic forgetting. However, task confusion is not well assessed within this framework, e.g., the discrepancy between classes of different tasks is not well learned (i.e., inter-task confusion, ITC), and certain priority is still given to the latest class batch (i.e., old-new confusion, ONC). We empirically validate the side effects of the two types of confusion. Meanwhile, a novel solution called Task Correlated Incremental Learning (TCIL) is proposed to encourage discriminative and fair feature utilization across tasks. TCIL performs a multi-level knowledge distillation to propagate knowledge learned from old tasks to the new one. It establishes information flow paths at both feature and logit levels, enabling the learning to be aware of old classes. Besides, attention mechanism and classifier re-scoring are applied to generate more fair classification scores. We conduct extensive experiments on CIFAR100 and ImageNet100 datasets. The results demonstrate that TCIL consistently achieves state-of-the-art accuracy. It mitigates both ITC and ONC, while showing advantages in battle with catastrophic forgetting even no rehearsal memory is reserved.
    Heterogeneous Synthetic Learner for Panel Data. (arXiv:2212.14580v1 [stat.ML])
    In the new era of personalization, learning the heterogeneous treatment effect (HTE) becomes an inevitable trend with numerous applications. Yet, most existing HTE estimation methods focus on independently and identically distributed observations and cannot handle the non-stationarity and temporal dependency in the common panel data setting. The treatment evaluators developed for panel data, on the other hand, typically ignore the individualized information. To fill the gap, in this paper, we initialize the study of HTE estimation in panel data. Under different assumptions for HTE identifiability, we propose the corresponding heterogeneous one-side and two-side synthetic learner, namely H1SL and H2SL, by leveraging the state-of-the-art HTE estimator for non-panel data and generalizing the synthetic control method that allows flexible data generating process. We establish the convergence rates of the proposed estimators. The superior performance of the proposed methods over existing ones is demonstrated by extensive numerical studies.
    ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech. (arXiv:2212.14518v1 [eess.AS])
    Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
    Do Bayesian Variational Autoencoders Know What They Don't Know?. (arXiv:2212.14272v1 [stat.ML])
    The problem of detecting the Out-of-Distribution (OoD) inputs is of paramount importance for Deep Neural Networks. It has been previously shown that even Deep Generative Models that allow estimating the density of the inputs may not be reliable and often tend to make over-confident predictions for OoDs, assigning to them a higher density than to the in-distribution data. This over-confidence in a single model can be potentially mitigated with Bayesian inference over the model parameters that take into account epistemic uncertainty. This paper investigates three approaches to Bayesian inference: stochastic gradient Markov chain Monte Carlo, Bayes by Backpropagation, and Stochastic Weight Averaging-Gaussian. The inference is implemented over the weights of the deep neural networks that parameterize the likelihood of the Variational Autoencoder. We empirically evaluate the approaches against several benchmarks that are often used for OoD detection: estimation of the marginal likelihood utilizing sampled model ensemble, typicality test, disagreement score, and Watanabe-Akaike Information Criterion. Finally, we introduce two simple scores that demonstrate the state-of-the-art performance.
    Wormhole MAML: Meta-Learning in Glued Parameter Space. (arXiv:2212.14094v1 [cs.LG])
    In this paper, we introduce a novel variation of model-agnostic meta-learning, where an extra multiplicative parameter is introduced in the inner-loop adaptation. Our variation creates a shortcut in the parameter space for the inner-loop adaptation and increases model expressivity in a highly controllable manner. We show both theoretically and numerically that our variation alleviates the problem of conflicting gradients and improves training dynamics. We conduct experiments on 3 distinctive problems, including a toy classification problem for threshold comparison, a regression problem for wavelet transform, and a classification problem on MNIST. We also discuss ways to generalize our method to a broader class of problems.
    Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients. (arXiv:2212.14319v1 [stat.ML])
    Partial differential equations (PDEs) are important tools to model physical systems, and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works like a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDE, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.
    Differentiable Search of Accurate and Robust Architectures. (arXiv:2212.14049v1 [cs.LG])
    Deep neural networks (DNNs) are found to be vulnerable to adversarial attacks, and various methods have been proposed for the defense. Among these methods, adversarial training has been drawing increasing attention because of its simplicity and effectiveness. However, the performance of the adversarial training is greatly limited by the architectures of target DNNs, which often makes the resulting DNNs with poor accuracy and unsatisfactory robustness. To address this problem, we propose DSARA to automatically search for the neural architectures that are accurate and robust after adversarial training. In particular, we design a novel cell-based search space specially for adversarial training, which improves the accuracy and the robustness upper bound of the searched architectures by carefully designing the placement of the cells and the proportional relationship of the filter numbers. Then we propose a two-stage search strategy to search for both accurate and robust neural architectures. At the first stage, the architecture parameters are optimized to minimize the adversarial loss, which makes full use of the effectiveness of the adversarial training in enhancing the robustness. At the second stage, the architecture parameters are optimized to minimize both the natural loss and the adversarial loss utilizing the proposed multi-objective adversarial training method, so that the searched neural architectures are both accurate and robust. We evaluate the proposed algorithm under natural data and various adversarial attacks, which reveals the superiority of the proposed method in terms of both accurate and robust architectures. We also conclude that accurate and robust neural architectures tend to deploy very different structures near the input and the output, which has great practical significance on both hand-crafting and automatically designing of accurate and robust neural architectures.
    Multimodal Explainability via Latent Shift applied to COVID-19 stratification. (arXiv:2212.14084v1 [cs.AI])
    We are witnessing a widespread adoption of artificial intelligence in healthcare. However, most of the advancements in deep learning (DL) in this area consider only unimodal data, neglecting other modalities. Their multimodal interpretation necessary for supporting diagnosis, prognosis and treatment decisions. In this work we present a deep architecture, explainable by design, which jointly learns modality reconstructions and sample classifications using tabular and imaging data. The explanation of the decision taken is computed by applying a latent shift that, simulates a counterfactual prediction revealing the features of each modality that contribute the most to the decision and a quantitative score indicating the modality importance. We validate our approach in the context of COVID-19 pandemic using the AIforCOVID dataset, which contains multimodal data for the early identification of patients at risk of severe outcome. The results show that the proposed method provides meaningful explanations without degrading the classification performance.
    Backward Curriculum Reinforcement Learning. (arXiv:2212.14214v1 [cs.AI])
    The current reinforcement learning algorithm uses forward-generated trajectories to train the agent. The forward-generated trajectories give the agent little guidance, so the agent can explore as much as possible. While the appreciation of reinforcement learning comes from enough exploration, this gives the trade-off of losing sample efficiency. The sampling efficiency is an important factor that decides the performance of the algorithm. Past tasks use reward shaping techniques and changing the structure of the network to increase sample efficiency, however these methods require many steps to implement. In this work, we propose novel reverse curriculum reinforcement learning. Reverse curriculum learning starts training the agent using the backward trajectory of the episode rather than the original forward trajectory. This gives the agent a strong reward signal, so the agent can learn in a more sample-efficient manner. Moreover, our method only requires a minor change in algorithm, which is reversing the order of trajectory before training the agent. Therefore, it can be simply applied to any state-of-art algorithms.
    Measuring and Estimating Key Quality Indicators in Cloud Gaming services. (arXiv:2212.14073v1 [cs.LG])
    User equipment is one of the main bottlenecks facing the gaming industry nowadays. The extremely realistic games which are currently available trigger high computational requirements of the user devices to run games. As a consequence, the game industry has proposed the concept of Cloud Gaming, a paradigm that improves gaming experience in reduced hardware devices. To this end, games are hosted on remote servers, relegating users' devices to play only the role of a peripheral for interacting with the game. However, this paradigm overloads the communication links connecting the users with the cloud. Therefore, service experience becomes highly dependent on network connectivity. To overcome this, Cloud Gaming will be boosted by the promised performance of 5G and future 6G networks, together with the flexibility provided by mobility in multi-RAT scenarios, such as WiFi. In this scope, the present work proposes a framework for measuring and estimating the main E2E metrics of the Cloud Gaming service, namely KQIs. In addition, different machine learning techniques are assessed for predicting KQIs related to Cloud Gaming user's experience. To this end, the main key quality indicators (KQIs) of the service such as input lag, freeze percent or perceived video frame rate are collected in a real environment. Based on these, results show that machine learning techniques provide a good estimation of these indicators solely from network-based metrics. This is considered a valuable asset to guide the delivery of Cloud Gaming services through cellular communications networks even without access to the user's device, as it is expected for telecom operators.
    Choosing the Number of Topics in LDA Models -- A Monte Carlo Comparison of Selection Criteria. (arXiv:2212.14074v1 [cs.CL])
    Selecting the number of topics in LDA models is considered to be a difficult task, for which alternative approaches have been proposed. The performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be implemented to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the DGPs are identified. Practical recommendations for LDA model selection in applications are derived.
    Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?. (arXiv:2212.14511v1 [cs.LG])
    We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations.
    A Novel Experts Advice Aggregation Framework Using Deep Reinforcement Learning for Portfolio Management. (arXiv:2212.14477v1 [q-fin.CP])
    Solving portfolio management problems using deep reinforcement learning has been getting much attention in finance for a few years. We have proposed a new method using experts signals and historical price data to feed into our reinforcement learning framework. Although experts signals have been used in previous works in the field of finance, as far as we know, it is the first time this method, in tandem with deep RL, is used to solve the financial portfolio management problem. Our proposed framework consists of a convolutional network for aggregating signals, another convolutional network for historical price data, and a vanilla network. We used the Proximal Policy Optimization algorithm as the agent to process the reward and take action in the environment. The results suggested that, on average, our framework could gain 90 percent of the profit earned by the best expert.
    Risk-Sensitive Policy with Distributional Reinforcement Learning. (arXiv:2212.14743v1 [cs.LG])
    Classical reinforcement learning (RL) techniques are generally concerned with the design of decision-making policies driven by the maximisation of the expected outcome. Nevertheless, this approach does not take into consideration the potential risk associated with the actions taken, which may be critical in certain applications. To address that issue, the present research work introduces a novel methodology based on distributional RL to derive sequential decision-making policies that are sensitive to the risk, the latter being modelled by the tail of the return probability distribution. The core idea is to replace the $Q$ function generally standing at the core of learning schemes in RL by another function taking into account both the expected return and the risk. Named the risk-based utility function $U$, it can be extracted from the random return distribution $Z$ naturally learnt by any distributional RL algorithm. This enables to span the complete potential trade-off between risk minimisation and expected return maximisation, in contrast to fully risk-averse methodologies. Fundamentally, this research yields a truly practical and accessible solution for learning risk-sensitive policies with minimal modification to the distributional RL algorithm, and with an emphasis on the interpretability of the resulting decision-making process.
    Multi-step-ahead Stock Price Prediction Using Recurrent Fuzzy Neural Network and Variational Mode Decomposition. (arXiv:2212.14687v1 [q-fin.ST])
    Financial time series prediction, a growing research topic, has attracted considerable interest from scholars, and several approaches have been developed. Among them, decomposition-based methods have achieved promising results. Most decomposition-based methods approximate a single function, which is insufficient for obtaining accurate results. Moreover, most existing researches have concentrated on one-step-ahead forecasting that prevents stock market investors from arriving at the best decisions for the future. This study proposes two novel methods for multi-step-ahead stock price prediction based on the issues outlined. DCT-MFRFNN, a method based on discrete cosine transform (DCT) and multi-functional recurrent fuzzy neural network (MFRFNN), uses DCT to reduce fluctuations in the time series and simplify its structure and MFRFNN to predict the stock price. VMD-MFRFNN, an approach based on variational mode decomposition (VMD) and MFRFNN, brings together their advantages. VMD-MFRFNN consists of two phases. The input signal is decomposed to several IMFs using VMD in the decomposition phase. In the prediction and reconstruction phase, each of the IMFs is given to a separate MFRFNN for prediction, and predicted signals are summed to reconstruct the output. Three financial time series, including Hang Seng Index (HSI), Shanghai Stock Exchange (SSE), and Standard & Poor's 500 Index (SPX), are used for the evaluation of the proposed methods. Experimental results indicate that VMD-MFRFNN surpasses other state-of-the-art methods. VMD-MFRFNN, on average, shows 35.93%, 24.88%, and 34.59% decreases in RMSE from the second-best model for HSI, SSE, and SPX, respectively. Also, DCT-MFRFNN outperforms MFRFNN in all experiments.
    Model-Centric and Data-Centric Aspects of Active Learning for Deep Neural Networks. (arXiv:2009.10835v3 [cs.LG] UPDATED)
    We study different aspects of active learning with deep neural networks in a consistent and unified way. i) We investigate incremental and cumulative training modes which specify how the newly labeled data are used for training. ii) We study active learning w.r.t. the model configurations such as the number of epochs and neurons as well as the choice of batch size. iii) We consider in detail the behavior of query strategies and their corresponding informativeness measures and accordingly propose more efficient querying procedures. iv) We perform statistical analyses, e.g., on actively learned classes and test error estimation, that reveal several insights about active learning. v) We investigate how active learning with neural networks can benefit from pseudo-labels as proxies for actual labels.
    Unsupervised Representation Learning with Minimax Distance Measures. (arXiv:1904.13223v3 [cs.LG] UPDATED)
    We investigate the use of Minimax distances to extract in a nonparametric way the features that capture the unknown underlying patterns and structures in the data. We develop a general-purpose and computationally efficient framework to employ Minimax distances with many machine learning methods that perform on numerical data. We study both computing the pairwise Minimax distances for all pairs of objects and as well as computing the Minimax distances of all the objects to/from a fixed (test) object. We first efficiently compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space equal to the pairwise Minimax distances in the original space. We also study the case of having multiple pairwise Minimax matrices, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition to obtain the relevant features. In the following, we study computing Minimax distances from a fixed (test) object which can be used for instance in K-nearest neighbor search. Similar to the case of all-pair pairwise Minimax distances, we develop an efficient and general-purpose algorithm that is applicable with any arbitrary base distance measure. Moreover, we investigate in detail the edges selected by the Minimax distances and thereby explore the ability of Minimax distances in detecting outlier objects. Finally, for each setting, we perform several experiments to demonstrate the effectiveness of our framework.
    Disentangled Explanations of Neural Network Predictions by Finding Relevant Subspaces. (arXiv:2212.14855v1 [cs.LG])
    Explainable AI transforms opaque decision strategies of ML models into explanations that are interpretable by the user, for example, identifying the contribution of each input feature to the prediction at hand. Such explanations, however, entangle the potentially multiple factors that enter into the overall complex decision strategy. We propose to disentangle explanations by finding relevant subspaces in activation space that can be mapped to more abstract human-understandable concepts and enable a joint attribution on concepts and input features. To automatically extract the desired representation, we propose new subspace analysis formulations that extend the principle of PCA and subspace analysis to explanations. These novel analyses, which we call principal relevant component analysis (PRCA) and disentangled relevant subspace analysis (DRSA), optimize relevance of projected activations rather than the more traditional variance or kurtosis. This enables a much stronger focus on subspaces that are truly relevant for the prediction and the explanation, in particular, ignoring activations or concepts to which the prediction model is invariant. Our approach is general enough to work alongside common attribution techniques such as Shapley Value, Integrated Gradients, or LRP. Our proposed methods show to be practically useful and compare favorably to the state of the art as demonstrated on benchmarks and three use cases.
    Invariance to Quantile Selection in Distributional Continuous Control. (arXiv:2212.14262v1 [cs.LG])
    In recent years distributional reinforcement learning has produced many state of the art results. Increasingly sample efficient Distributional algorithms for the discrete action domain have been developed over time that vary primarily in the way they parameterize their approximations of value distributions, and how they quantify the differences between those distributions. In this work we transfer three of the most well-known and successful of those algorithms (QR-DQN, IQN and FQF) to the continuous action domain by extending two powerful actor-critic algorithms (TD3 and SAC) with distributional critics. We investigate whether the relative performance of the methods for the discrete action space translates to the continuous case. To that end we compare them empirically on the pybullet implementations of a set of continuous control tasks. Our results indicate qualitative invariance regarding the number and placement of distributional atoms in the deterministic, continuous action setting.
    HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks. (arXiv:2102.02515v5 [cs.LG] UPDATED)
    The behaviors of deep neural networks (DNNs) are notoriously resistant to human interpretations. In this paper, we propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of their training data. Existing approaches generally estimate data contributions around the final model parameters and ignore how the training data shape the optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the weights of training data, HYDRA assesses the contribution of training data toward test data points throughout the training trajectory. In order to accelerate computation, we remove the Hessian from the calculation and prove that, under moderate conditions, the approximation error is bounded. Corroborating this theoretical claim, empirical results indicate the error is indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels. The source code is available at https://github.com/cyyever/aaai_hydra_8686.
    Characterization of the Global Bias Problem in Aerial Federated Learning. (arXiv:2212.14360v1 [cs.NI])
    Unmanned aerial vehicles (UAVs) mobility enables flexible and customized federated learning (FL) at the network edge. However, the underlying uncertainties in the aerial-terrestrial wireless channel may lead to a biased FL model. In particular, the distribution of the global model and the aggregation of the local updates within the FL learning rounds at the UAVs are governed by the reliability of the wireless channel. This creates an undesirable bias towards the training data of ground devices with better channel conditions, and vice versa. This paper characterizes the global bias problem of aerial FL in large-scale UAV networks. To this end, the paper proposes a channel-aware distribution and aggregation scheme to enforce equal contribution from all devices in the FL training as a means to resolve the global bias problem. We demonstrate the convergence of the proposed method by experimenting with the MNIST dataset and show its superiority compared to existing methods. The obtained results enable system parameter tuning to relieve the impact of the aerial channel deficiency on the FL convergence rate.
    Easing Automatic Neurorehabilitation via Classification and Smoothness Analysis. (arXiv:2212.14797v1 [eess.SP])
    Assessing the quality of movements for post-stroke patients during the rehabilitation phase is vital given that there is no standard stroke rehabilitation plan for all the patients. In fact, it depends basically on the patient's functional independence and its progress along the rehabilitation sessions. To tackle this challenge and make neurorehabilitation more agile, we propose an automatic assessment pipeline that starts by recognizing patients' movements by means of a shallow deep learning architecture, then measuring the movement quality using jerk measure and related measures. A particularity of this work is that the dataset used is clinically relevant, since it represents movements inspired from Fugl-Meyer a well common upper-limb clinical stroke assessment scale for stroke patients. We show that it is possible to detect the contrast between healthy and patients movements in terms of smoothness, besides achieving conclusions about the patients' progress during the rehabilitation sessions that correspond to the clinicians' findings about each case.
    Black-box language model explanation by context length probing. (arXiv:2212.14815v1 [cs.CL])
    The increasingly widespread adoption of large language models has highlighted the need for improving their explainability. We present context length probing, a novel explanation technique for causal language models, based on tracking the predictions of a model as a function of the length of available context, and allowing to assign differential importance scores to different contexts. The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities. We apply context length probing to large pre-trained language models and offer some initial analyses and insights, including the potential for studying long-range dependencies. The source code and a demo of the method are available.
    Detecting Change Intervals with Isolation Distributional Kernel. (arXiv:2212.14630v1 [cs.LG])
    Detecting abrupt changes in data distribution is one of the most significant tasks in streaming data analysis. Although many unsupervised Change-Point Detection (CPD) methods have been proposed recently to identify those changes, they still suffer from missing subtle changes, poor scalability, or/and sensitive to noise points. To meet these challenges, we are the first to generalise the CPD problem as a special case of the Change-Interval Detection (CID) problem. Then we propose a CID method, named iCID, based on a recent Isolation Distributional Kernel (IDK). iCID identifies the change interval if there is a high dissimilarity score between two non-homogeneous temporal adjacent intervals. The data-dependent property and finite feature map of IDK enabled iCID to efficiently identify various types of change points in data streams with the tolerance of noise points. Moreover, the proposed online and offline versions of iCID have the ability to optimise key parameter settings. The effectiveness and efficiency of iCID have been systematically verified on both synthetic and real-world datasets.
    PCCC: The Pairwise-Confidence-Constraints-Clustering Algorithm. (arXiv:2212.14437v1 [cs.LG])
    We consider a semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm can include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard constraints or all soft constraints both in terms of running time and various metrics of solution quality. The source code of the PCCC algorithm is publicly available on GitHub.
    Learning from Data Streams: An Overview and Update. (arXiv:2212.14720v1 [cs.LG])
    The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
    Asynchronous Hybrid Reinforcement Learning for Latency and Reliability Optimization in the Metaverse over Wireless Communications. (arXiv:2212.14749v1 [cs.LG])
    Technology advancements in wireless communications and high-performance Extended Reality (XR) have empowered the developments of the Metaverse. The demand for Metaverse applications and hence, real-time digital twinning of real-world scenes is increasing. Nevertheless, the replication of 2D physical world images into 3D virtual world scenes is computationally intensive and requires computation offloading. The disparity in transmitted scene dimension (2D as opposed to 3D) leads to asymmetric data sizes in uplink (UL) and downlink (DL). To ensure the reliability and low latency of the system, we consider an asynchronous joint UL-DL scenario where in the UL stage, the smaller data size of the physical world scenes captured by multiple extended reality users (XUs) will be uploaded to the Metaverse Console (MC) to be construed and rendered. In the DL stage, the larger-size 3D virtual world scenes need to be transmitted back to the XUs. The decisions pertaining to computation offloading and channel assignment are optimized in the UL stage, and the MC will optimize power allocation for users assigned with a channel in the UL transmission stage. Some problems arise therefrom: (i) interactive multi-process chain, specifically Asynchronous Markov Decision Process (AMDP), (ii) joint optimization in multiple processes, and (iii) high-dimensional objective functions, or hybrid reward scenarios. To ensure the reliability and low latency of the system, we design a novel multi-agent reinforcement learning algorithm structure, namely Asynchronous Actors Hybrid Critic (AAHC). Extensive experiments demonstrate that compared to proposed baselines, AAHC obtains better solutions with preferable training time.
    Langevin algorithms for very deep Neural Networks with application to image classification. (arXiv:2212.14718v1 [cs.LG])
    Training a very deep neural network is a challenging task, as the deeper a neural network is, the more non-linear it is. We compare the performances of various preconditioned Langevin algorithms with their non-Langevin counterparts for the training of neural networks of increasing depth. For shallow neural networks, Langevin algorithms do not lead to any improvement, however the deeper the network is and the greater are the gains provided by Langevin algorithms. Adding noise to the gradient descent allows to escape from local traps, which are more frequent for very deep neural networks. Following this heuristic we introduce a new Langevin algorithm called Layer Langevin, which consists in adding Langevin noise only to the weights associated to the deepest layers. We then prove the benefits of Langevin and Layer Langevin algorithms for the training of popular deep residual architectures for image classification.
    On the Convergence of Discounted Policy Gradient Methods. (arXiv:2212.14066v1 [cs.LG])
    Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.
    Conformal Prediction Intervals for Remaining Useful Lifetime Estimation. (arXiv:2212.14612v1 [cs.LG])
    The main objective of Prognostics and Health Management is to estimate the Remaining Useful Lifetime (RUL), namely, the time that a system or a piece of equipment is still in working order before starting to function incorrectly. In recent years, numerous machine learning algorithms have been proposed for RUL estimation, mainly focusing on providing more accurate RUL predictions. However, there are many sources of uncertainty in the problem, such as inherent randomness of systems failure, lack of knowledge regarding their future states, and inaccuracy of the underlying predictive models, making it infeasible to predict the RULs precisely. Hence, it is of utmost importance to quantify the uncertainty alongside the RUL predictions. In this work, we investigate the conformal prediction (CP) framework that represents uncertainty by predicting sets of possible values for the target variable (intervals in the case of RUL) instead of making point predictions. Under very mild technical assumptions, CP formally guarantees that the actual value (true RUL) is covered by the predicted set with a degree of certainty that can be prespecified. We study three CP algorithms to conformalize any single-point RUL predictor and turn it into a valid interval predictor. Finally, we conformalize two single-point RUL predictors, deep convolutional neural networks and gradient boosting, and illustrate their performance on the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) data sets.
    Deep Temporal Contrastive Clustering. (arXiv:2212.14366v1 [cs.LG])
    Recently the deep learning has shown its advantage in representation learning and clustering for time series data. Despite the considerable progress, the existing deep time series clustering approaches mostly seek to train the deep neural network by some instance reconstruction based or cluster distribution based objective, which, however, lack the ability to exploit the sample-wise (or augmentation-wise) contrastive information or even the higher-level (e.g., cluster-level) contrastiveness for learning discriminative and clustering-friendly representations. In light of this, this paper presents a deep temporal contrastive clustering (DTCC) approach, which for the first time, to our knowledge, incorporates the contrastive learning paradigm into the deep time series clustering research. Specifically, with two parallel views generated from the original time series and their augmentations, we utilize two identical auto-encoders to learn the corresponding representations, and in the meantime perform the cluster distribution learning by incorporating a k-means objective. Further, two levels of contrastive learning are simultaneously enforced to capture the instance-level and cluster-level contrastive information, respectively. With the reconstruction loss of the auto-encoder, the cluster distribution loss, and the two levels of contrastive losses jointly optimized, the network architecture is trained in a self-supervised manner and the clustering result can thereby be obtained. Experiments on a variety of time series datasets demonstrate the superiority of our DTCC approach over the state-of-the-art.
    A Theoretical Framework for AI Models Explainability. (arXiv:2212.14447v1 [cs.AI])
    Explainability is a vibrant research topic in the artificial intelligence community, with growing interest across methods and domains. Much has been written about the topic, yet explainability still lacks shared terminology and a framework capable of providing structural soundness to explanations. In our work, we address these issues by proposing a novel definition of explanation that is a synthesis of what can be found in the literature. We recognize that explanations are not atomic but the product of evidence stemming from the model and its input-output and the human interpretation of this evidence. Furthermore, we fit explanations into the properties of faithfulness (i.e., the explanation being a true description of the model's decision-making) and plausibility (i.e., how much the explanation looks convincing to the user). Using our proposed theoretical framework simplifies how these properties are ope rationalized and provide new insight into common explanation methods that we analyze as case studies.
    FunkNN: Neural Interpolation for Functional Generation. (arXiv:2212.14042v1 [eess.IV])
    Can we build continuous generative models which generalize across scales, can be evaluated at any coordinate, admit calculation of exact derivatives, and are conceptually simple? Existing MLP-based architectures generate worse samples than the grid-based generators with favorable convolutional inductive biases. Models that focus on generating images at different scales do better, but employ complex architectures not designed for continuous evaluation of images and derivatives. We take a signal-processing perspective and treat continuous image generation as interpolation from samples. Indeed, correctly sampled discrete images contain all information about the low spatial frequencies. The question is then how to extrapolate the spectrum in a data-driven way while meeting the above design criteria. Our answer is FunkNN -- a new convolutional network which learns how to reconstruct continuous images at arbitrary coordinates and can be applied to any image dataset. Combined with a discrete generative model it becomes a functional generator which can act as a prior in continuous ill-posed inverse problems. We show that FunkNN generates high-quality continuous images and exhibits strong out-of-distribution performance thanks to its patch-based design. We further showcase its performance in several stylized inverse problems with exact spatial derivatives.
    Towards automating Codenames spymasters with deep reinforcement learning. (arXiv:2212.14104v1 [cs.CL])
    Although most reinforcement learning research has centered on competitive games, little work has been done on applying it to co-operative multiplayer games or text-based games. Codenames is a board game that involves both asymmetric co-operation and natural language processing, which makes it an excellent candidate for advancing RL research. To my knowledge, this work is the first to formulate Codenames as a Markov Decision Process and apply some well-known reinforcement learning algorithms such as SAC, PPO, and A2C to the environment. Although none of the above algorithms converge for the Codenames environment, neither do they converge for a simplified environment called ClickPixel, except when the board size is small.
    Transformer in Transformer as Backbone for Deep Reinforcement Learning. (arXiv:2212.14538v1 [cs.LG])
    Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work focuses on the former. Previous methods build the network with several modules like CNN, LSTM and Attention. Recent methods combine the Transformer with these modules for better performance. However, it requires tedious optimization skills to train a network composed of mixed modules, making these methods inconvenient to be used in practice. In this paper, we propose to design \emph{pure Transformer-based networks} for deep RL, aiming at providing off-the-shelf backbones for both the online and offline settings. Specifically, the Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way: the inner one is used to process a single observation, while the outer one is responsible for processing the observation history; combining both is expected to extract spatial-temporal representations for good decision-making. Experiments show that TIT can achieve satisfactory performance in different settings, consistently.
    Reservoir kernels and Volterra series. (arXiv:2212.14641v1 [cs.LG])
    A universal kernel is constructed whose sections approximate any causal and time-invariant filter in the fading memory category with inputs and outputs in a finite-dimensional Euclidean space. This kernel is built using the reservoir functional associated with a state-space representation of the Volterra series expansion available for any analytic fading memory filter. It is hence called the Volterra reservoir kernel. Even though the state-space representation and the corresponding reservoir feature map are defined on an infinite-dimensional tensor algebra space, the kernel map is characterized by explicit recursions that are readily computable for specific data sets when employed in estimation problems using the representer theorem. We showcase the performance of the Volterra reservoir kernel in a popular data science application in relation to bitcoin price prediction.
    A Generalist Framework for Panoptic Segmentation of Images and Videos. (arXiv:2210.06366v2 [cs.CV] UPDATED)
    Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model based on analog bits is used to model panoptic masks, with a simple, generic architecture and loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our generalist approach can perform competitively to state-of-the-art specialist methods in similar settings.
    Mixture of von Mises-Fisher distribution with sparse prototypes. (arXiv:2212.14591v1 [cs.LG])
    Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
    Offline Policy Optimization in RL with Variance Regularizaton. (arXiv:2212.14405v1 [cs.LG])
    Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.
    Posterior sampling with CNN-based, Plug-and-Play regularization with applications to Post-Stack Seismic Inversion. (arXiv:2212.14595v1 [stat.ML])
    Uncertainty quantification is crucial to inverse problems, as it could provide decision-makers with valuable information about the inversion results. For example, seismic inversion is a notoriously ill-posed inverse problem due to the band-limited and noisy nature of seismic data. It is therefore of paramount importance to quantify the uncertainties associated to the inversion process to ease the subsequent interpretation and decision making processes. Within this framework of reference, sampling from a target posterior provides a fundamental approach to quantifying the uncertainty in seismic inversion. However, selecting appropriate prior information in a probabilistic inversion is crucial, yet non-trivial, as it influences the ability of a sampling-based inference in providing geological realism in the posterior samples. To overcome such limitations, we present a regularized variational inference framework that performs posterior inference by implicitly regularizing the Kullback-Leibler divergence loss with a CNN-based denoiser by means of the Plug-and-Play methods. We call this new algorithm Plug-and-Play Stein Variational Gradient Descent (PnP-SVGD) and demonstrate its ability in producing high-resolution, trustworthy samples representative of the subsurface structures, which we argue could be used for post-inference tasks such as reservoir modelling and history matching. To validate the proposed method, numerical tests are performed on both synthetic and field post-stack seismic data.
    Bayesian Interpolation with Deep Linear Networks. (arXiv:2212.14457v1 [stat.ML])
    This article concerns Bayesian inference using deep linear networks with output dimension one. In the interpolating (zero noise) regime we show that with Gaussian weight priors and MSE negative log-likelihood loss both the predictive posterior and the Bayesian model evidence can be written in closed form in terms of a class of meromorphic special functions called Meijer-G functions. These results are non-asymptotic and hold for any training dataset, network depth, and hidden layer widths, giving exact solutions to Bayesian interpolation using a deep Gaussian process with a Euclidean covariance at each layer. Through novel asymptotic expansions of Meijer-G functions, a rich new picture of the role of depth emerges. Specifically, we find: ${\bf \text{The role of depth in extrapolation}}$: The posteriors in deep linear networks with data-independent priors are the same as in shallow networks with evidence maximizing data-dependent priors. In this sense, deep linear networks make provably optimal predictions. ${\bf \text{The role of depth in model selection}}$: Starting from data-agnostic priors, Bayesian model evidence in wide networks is only maximized at infinite depth. This gives a principled reason to prefer deeper networks (at least in the linear case). ${\bf \text{Scaling laws relating depth, width, and number of datapoints}}$: With data-agnostic priors, a novel notion of effective depth given by \[\#\text{hidden layers}\times\frac{\#\text{training data}}{\text{network width}}\] determines the Bayesian posterior in wide linear networks, giving rigorous new scaling laws for generalization error.
    On the utility of feature selection in building two-tier decision trees. (arXiv:2212.14448v1 [cs.LG])
    Nowadays, feature selection is frequently used in machine learning when there is a risk of performance degradation due to overfitting or when computational resources are limited. During the feature selection process, the subset of features that are most relevant and least redundant is chosen. In recent years, it has become clear that, in addition to relevance and redundancy, features' complementarity must be considered. Informally, if the features are weak predictors of the target variable separately and strong predictors when combined, then they are complementary. It is demonstrated in this paper that the synergistic effect of complementary features mutually amplifying each other in the construction of two-tier decision trees can be interfered with by another feature, resulting in a decrease in performance. It is demonstrated using cross-validation on both synthetic and real datasets, regression and classification, that removing or eliminating the interfering feature can improve performance by up to 24 times. It has also been discovered that the lesser the domain is learned, the greater the increase in performance. More formally, it is demonstrated that there is a statistically significant negative rank correlation between performance on the dataset prior to the elimination of the interfering feature and performance growth after the elimination of the interfering feature. It is concluded that this broadens the scope of feature selection methods for cases where data and computational resources are sufficient.
    Defense Against Adversarial Attacks on Audio DeepFake Detection. (arXiv:2212.14597v1 [cs.SD])
    Audio DeepFakes are artificially generated utterances created using deep learning methods with the main aim to fool the listeners, most of such audio is highly convincing. Their quality is sufficient to pose a serious threat in terms of security and privacy, such as the reliability of news or defamation. To prevent the threats, multiple neural networks-based methods to detect generated speech have been proposed. In this work, we cover the topic of adversarial attacks, which decrease the performance of detectors by adding superficial (difficult to spot by a human) changes to input data. Our contribution contains evaluating the robustness of 3 detection architectures against adversarial attacks in two scenarios (white-box and using transferability mechanism) and enhancing it later by the use of adversarial training performed by our novel adaptive training method.
    The Voronoigram: Minimax Estimation of Bounded Variation Functions From Scattered Data. (arXiv:2212.14514v1 [stat.ML])
    We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $\theta_i,\theta_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded.
    Quantile Off-Policy Evaluation via Deep Conditional Generative Learning. (arXiv:2212.14466v1 [stat.ML])
    Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. It is critical in a number of sequential decision making problems ranging from healthcare to technology industries. Most of the work in existing literature is focused on evaluating the mean outcome of a given policy, and ignores the variability of the outcome. However, in a variety of applications, criteria other than the mean may be more sensible. For example, when the reward distribution is skewed and asymmetric, quantile-based metrics are often preferred for their robustness. In this paper, we propose a doubly-robust inference procedure for quantile OPE in sequential decision making and study its asymptotic properties. In particular, we propose utilizing state-of-the-art deep conditional generative learning methods to handle parameter-dependent nuisance function estimation. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform. In particular, we find that our proposed estimator outperforms classical OPE estimators for the mean in settings with heavy-tailed reward distributions.
    Graph Federated Learning for CIoT Devices in Smart Home Applications. (arXiv:2212.14395v1 [cs.LG])
    This paper deals with the problem of statistical and system heterogeneity in a cross-silo Federated Learning (FL) framework where there exist a limited number of Consumer Internet of Things (CIoT) devices in a smart building. We propose a novel Graph Signal Processing (GSP)-inspired aggregation rule based on graph filtering dubbed ``G-Fedfilt''. The proposed aggregator enables a structured flow of information based on the graph's topology. This behavior allows capturing the interconnection of CIoT devices and training domain-specific models. The embedded graph filter is equipped with a tunable parameter which enables a continuous trade-off between domain-agnostic and domain-specific FL. In the case of domain-agnostic, it forces G-Fedfilt to act similar to the conventional Federated Averaging (FedAvg) aggregation rule. The proposed G-Fedfilt also enables an intrinsic smooth clustering based on the graph connectivity without explicitly specified which further boosts the personalization of the models in the framework. In addition, the proposed scheme enjoys a communication-efficient time-scheduling to alleviate the system heterogeneity. This is accomplished by adaptively adjusting the amount of training data samples and sparsity of the models' gradients to reduce communication desynchronization and latency. Simulation results show that the proposed G-Fedfilt achieves up to $3.99\% $ better classification accuracy than the conventional FedAvg when concerning model personalization on the statistically heterogeneous local datasets, while it is capable of yielding up to $2.41\%$ higher accuracy than FedAvg in the case of testing the generalization of the models.
    GPT Takes the Bar Exam. (arXiv:2212.14402v1 [cs.CL])
    Nearly all jurisdictions in the United States require a professional license exam, commonly referred to as "the Bar Exam," as a precondition for law practice. To even sit for the exam, most jurisdictions require that an applicant completes at least seven years of post-secondary education, including three years at an accredited law school. In addition, most test-takers also undergo weeks to months of further, exam-specific preparation. Despite this significant investment of time and capital, approximately one in five test-takers still score under the rate required to pass the exam on their first try. In the face of a complex task that requires such depth of knowledge, what, then, should we expect of the state of the art in "AI?" In this research, we document our experimental evaluation of the performance of OpenAI's `text-davinci-003` model, often-referred to as GPT-3.5, on the multistate multiple choice (MBE) section of the exam. While we find no benefit in fine-tuning over GPT-3.5's zero-shot performance at the scale of our training data, we do find that hyperparameter optimization and prompt engineering positively impacted GPT-3.5's zero-shot performance. For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam, significantly in excess of the 25% baseline guessing rate, and performs at a passing rate for both Evidence and Torts. GPT-3.5's ranking of responses is also highly-correlated with correctness; its top two and top three choices are correct 71% and 88% of the time, respectively, indicating very strong non-entailment performance. While our ability to interpret these results is limited by nascent scientific understanding of LLMs and the proprietary nature of GPT, we believe that these results strongly suggest that an LLM will pass the MBE component of the Bar Exam in the near future.
    Certifying Safety in Reinforcement Learning under Adversarial Perturbation Attacks. (arXiv:2212.14115v1 [cs.LG])
    Function approximation has enabled remarkable advances in applying reinforcement learning (RL) techniques in environments with high-dimensional inputs, such as images, in an end-to-end fashion, mapping such inputs directly to low-level control. Nevertheless, these have proved vulnerable to small adversarial input perturbations. A number of approaches for improving or certifying robustness of end-to-end RL to adversarial perturbations have emerged as a result, focusing on cumulative reward. However, what is often at stake in adversarial scenarios is the violation of fundamental properties, such as safety, rather than the overall reward that combines safety with efficiency. Moreover, properties such as safety can only be defined with respect to true state, rather than the high-dimensional raw inputs to end-to-end policies. To disentangle nominal efficiency and adversarial safety, we situate RL in deterministic partially-observable Markov decision processes (POMDPs) with the goal of maximizing cumulative reward subject to safety constraints. We then propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time. We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL. Our experiments demonstrate both the efficacy of the proposed approach for certifying safety in adversarial environments, and the value of the PSRL framework coupled with adversarial training in improving certified safety while preserving high nominal reward and high-quality predictions of true state.
    Unsupervised construction of representations for oil wells via Transformers. (arXiv:2212.14246v1 [cs.LG])
    Determining and predicting reservoir formation properties for newly drilled wells represents a significant challenge. One of the variations of these properties evaluation is well-interval similarity. Many methodologies for similarity learning exist: from rule-based approaches to deep neural networks. Recently, articles adopted, e.g. recurrent neural networks to build a similarity model as we deal with sequential data. Such an approach suffers from short-term memory, as it pays more attention to the end of a sequence. Neural network with Transformer architecture instead cast their attention over all sequences to make a decision. To make them more efficient in terms of computational time, we introduce a limited attention mechanism similar to Informer and Performer architectures. We conduct experiments on open datasets with more than 20 wells making our experiments reliable and suitable for industrial usage. The best results were obtained with our adaptation of the Informer variant of Transformer with ROC AUC 0.982. It outperforms classical approaches with ROC AUC 0.824, Recurrent neural networks with ROC AUC 0.934 and straightforward usage of Transformers with ROC AUC 0.961.
    Finding Representative Group Fairness Metrics Using Correlation Estimations. (arXiv:2109.05697v2 [cs.LG] UPDATED)
    It is of critical importance to be aware of the historical discrimination embedded in the data and to consider a fairness measure to reduce bias throughout the predictive modeling pipeline. Given various notions of fairness defined in the literature, investigating the correlation and interaction among metrics is vital for addressing unfairness. Practitioners and data scientists should be able to comprehend each metric and examine their impact on one another given the context, use case, and regulations. Exploring the combinatorial space of different metrics for such examination is burdensome. To alleviate the burden of selecting fairness notions for consideration, we propose a framework that estimates the correlation among fairness notions. Our framework consequently identifies a set of diverse and semantically distinct metrics as representative for a given context. We propose a Monte-Carlo sampling technique for computing the correlations between fairness metrics by indirect and efficient perturbation in the model space. Using the estimated correlations, we then find a subset of representative metrics. The paper proposes a generic method that can be generalized to any arbitrary set of fairness metrics. We showcase the validity of the proposal using comprehensive experiments on real-world benchmark datasets.
    Translating Hanja Historical Documents to Contemporary Korean and English. (arXiv:2205.10019v4 [cs.CL] UPDATED)
    The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, `Hanja', and were translated into Korean from 1968 to 1993. The resulting translation was however too literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012. Since then, the records of only one king have been completed in a decade. In parallel, expert translators are working on English translation, also at a slow pace and produced only one king's records in English so far. Thus, we propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English. Built on top of multilingual neural machine translation, H2KE learns to translate a historical document written in Hanja, from both a full dataset of outdated Korean translation and a small dataset of more recently translated contemporary Korean and English. We compare our method against two baselines: a recent model that simultaneously learns to restore and translate Hanja historical document and a Transformer based model trained only on newly translated corpora. The experiments reveal that our method significantly outperforms the baselines in terms of BLEU scores for both contemporary Korean and English translations. We further conduct extensive human evaluation which shows that our translation is preferred over the original expert translations by both experts and non-expert Korean speakers.
    A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness. (arXiv:2205.00403v2 [cs.LG] UPDATED)
    Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines
    Self-Attentive Pooling for Efficient Deep Learning. (arXiv:2209.07659v3 [cs.CV] UPDATED)
    Efficient custom pooling techniques that can aggressively trim the dimensions of a feature map and thereby reduce inference compute and memory footprint for resource-constrained computer vision applications have recently gained significant traction. However, prior pooling works extract only the local context of the activation maps, limiting their effectiveness. In contrast, we propose a novel non-local self-attentive pooling method that can be used as a drop-in replacement to the standard pooling layers, such as max/average pooling or strided convolution. The proposed self-attention module uses patch embedding, multi-head self-attention, and spatial-channel restoration, followed by sigmoid activation and exponential soft-max. This self-attention mechanism efficiently aggregates dependencies between non-local activation patches during down-sampling. Extensive experiments on standard object classification and detection tasks with various convolutional neural network (CNN) architectures demonstrate the superiority of our proposed mechanism over the state-of-the-art (SOTA) pooling techniques. In particular, we surpass the test accuracy of existing pooling techniques on different variants of MobileNet-V2 on ImageNet by an average of 1.2%. With the aggressive down-sampling of the activation maps in the initial layers (providing up to 22x reduction in memory consumption), our approach achieves 1.43% higher test accuracy compared to SOTA techniques with iso-memory footprints. This enables the deployment of our models in memory-constrained devices, such as micro-controllers (without losing significant accuracy), because the initial activation maps consume a significant amount of on-chip memory for high-resolution images required for complex vision tasks. Our proposed pooling method also leverages the idea of channel pruning to further reduce memory footprints.
    Out-Of-Distribution Generalization on Graphs: A Survey. (arXiv:2202.07987v2 [cs.LG] UPDATED)
    Graph machine learning has been extensively studied in both academia and industry. Although booming with a vast number of emerging methods and techniques, most of the literature is built on the in-distribution hypothesis, i.e., testing and training graph data are identically distributed. However, this in-distribution hypothesis can hardly be satisfied in many real-world graph scenarios where the model performance substantially degrades when there exist distribution shifts between testing and training graph data. To solve this critical problem, out-of-distribution (OOD) generalization on graphs, which goes beyond the in-distribution hypothesis, has made great progress and attracted ever-increasing attention from the research community. In this paper, we comprehensively survey OOD generalization on graphs and present a detailed review of recent advances in this area. First, we provide a formal problem definition of OOD generalization on graphs. Second, we categorize existing methods into three classes from conceptually different perspectives, i.e., data, model, and learning strategy, based on their positions in the graph machine learning pipeline, followed by detailed discussions for each category. We also review the theories related to OOD generalization on graphs and introduce the commonly used graph datasets for thorough evaluations. Finally, we share our insights on future research directions. This paper is the first systematic and comprehensive review of OOD generalization on graphs, to the best of our knowledge.
    NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers. (arXiv:2207.13066v2 [cs.LG] UPDATED)
    Deep-learning (DL) compilers such as TVM and TensorRT are increasingly being used to optimize deep neural network (DNN) models to meet performance, resource utilization and other requirements. Bugs in these compilers can result in models whose semantics differ from the original ones, producing incorrect results that corrupt the correctness of downstream applications. However, finding bugs in these compilers is challenging due to their complexity. In this work, we propose a new fuzz testing approach for finding bugs in deep-learning compilers. Our core approach consists of (i) generating diverse yet valid DNN test models that can exercise a large part of the compiler's transformation logic using light-weight operator specifications; (ii) performing gradient-based search to find model inputs that avoid any floating-point exceptional values during model execution, reducing the chance of missed bugs or false alarms; and (iii) using differential testing to identify bugs. We implemented this approach in NNSmith which has found 72 new bugs for TVM, TensorRT, ONNXRuntime, and PyTorch to date. Of these 58 have been confirmed and 51 have been fixed by their respective project maintainers.
    Batchless Normalization: How to Normalize Activations with just one Instance in Memory. (arXiv:2212.14729v1 [cs.LG])
    In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.
    Discovering Useful Compact Sets of Sequential Rules in a Long Sequence. (arXiv:2109.07519v2 [cs.LG] UPDATED)
    We are interested in understanding the underlying generation process for long sequences of symbolic events. To do so, we propose COSSU, an algorithm to mine small and meaningful sets of sequential rules. The rules are selected using an MDL-inspired criterion that favors compactness and relies on a novel rule-based encoding scheme for sequences. Our evaluation shows that COSSU can successfully retrieve relevant sets of closed sequential rules from a long sequence. Such rules constitute an interpretable model that exhibits competitive accuracy for the tasks of next-element prediction and classification.
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v3 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    CLNode: Curriculum Learning for Node Classification. (arXiv:2206.07258v2 [cs.LG] UPDATED)
    Node classification is a fundamental graph-based task that aims to predict the classes of unlabeled nodes, for which Graph Neural Networks (GNNs) are the state-of-the-art methods. Current GNNs assume that nodes in the training set contribute equally during training. However, the quality of training nodes varies greatly, and the performance of GNNs could be harmed by two types of low-quality training nodes: (1) inter-class nodes situated near class boundaries that lack the typical characteristics of their corresponding classes. Because GNNs are data-driven approaches, training on these nodes could degrade the accuracy. (2) mislabeled nodes. In real-world graphs, nodes are often mislabeled, which can significantly degrade the robustness of GNNs. To mitigate the detrimental effect of the low-quality training nodes, we present CLNode, which employs a selective training strategy to train GNN based on the quality of nodes. Specifically, we first design a multi-perspective difficulty measurer to accurately measure the quality of training nodes. Then, based on the measured qualities, we employ a training scheduler that selects appropriate training nodes to train GNN in each epoch. To evaluate the effectiveness of CLNode, we conduct extensive experiments by incorporating it in six representative backbone GNNs. Experimental results on real-world networks demonstrate that CLNode is a general framework that can be combined with various GNNs to improve their accuracy and robustness.
    Algorithmic decision making methods for fair credit scoring. (arXiv:2209.07912v2 [cs.LG] UPDATED)
    The utility of machine learning in evaluating the creditworthiness of loan applicants has been proofed since decades ago. However, automatic decisions may lead to different treatments over groups or individuals, potentially causing discrimination. This paper benchmarks 12 top bias mitigation methods discussing their performance based on 5 different fairness metrics, accuracy achieved, and potential profits for the financial institutions. Our findings show the difficulties in achieving fairness while preserving accuracy and profits. Additionally, it highlights some of the best and worst performers and helps bridging the gap between experimental machine learning and its industrial application.
    Reducing Certified Regression to Certified Classification for General Poisoning Attacks. (arXiv:2208.13904v2 [cs.LG] UPDATED)
    Adversarial training instances can severely distort a model's behavior. This work investigates certified regression defenses, which provide guaranteed limits on how much a regressor's prediction may change under a poisoning attack. Our key insight is that certified regression reduces to voting-based certified classification when using median as a model's primary decision function. Coupling our reduction with existing certified classifiers, we propose six new regressors provably-robust to poisoning attacks. To the extent of our knowledge, this is the first work that certifies the robustness of individual regression predictions without any assumptions about the data distribution and model architecture. We also show that the assumptions made by existing state-of-the-art certified classifiers are often overly pessimistic. We introduce a tighter analysis of model robustness, which in many cases results in significantly improved certified guarantees. Lastly, we empirically demonstrate our approaches' effectiveness on both regression and classification data, where the accuracy of up to 50% of test predictions can be guaranteed under 1% training set corruption and up to 30% of predictions under 4% corruption. Our source code is available at https://github.com/ZaydH/certified-regression.
    Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization. (arXiv:2206.07837v3 [cs.LG] UPDATED)
    Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process.
    Semantic Communications with Discrete-time Analog Transmission: A PAPR Perspective. (arXiv:2208.08342v3 [cs.IT] UPDATED)
    Recent progress in deep learning (DL)-based joint source-channel coding (DeepJSCC) has led to a new paradigm of semantic communications. Two salient features of DeepJSCC-based semantic communications are the exploitation of semantic-aware features directly from the source signal, and the discrete-time analog transmission (DTAT) of these features. Compared with traditional digital communications, semantic communications with DeepJSCC provide superior reconstruction performance at the receiver and graceful degradation with diminishing channel quality, but also exhibit a large peak-to-average power ratio (PAPR) in the transmitted signal. An open question has been whether the gains of DeepJSCC come from the additional freedom brought by the high-PAPR continuous-amplitude signal. In this paper, we address this question by exploring three PAPR reduction techniques in the application of image transmission. We confirm that the superior image reconstruction performance of DeepJSCC-based semantic communications can be retained while the transmitted PAPR is suppressed to an acceptable level. This observation is an important step towards the implementation of DeepJSCC in practical semantic communication systems.
    Git Re-Basin: Merging Models modulo Permutation Symmetries. (arXiv:2209.04836v5 [cs.LG] UPDATED)
    The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. (2021). We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100. Additionally, we investigate intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.
    Variationally Mimetic Operator Networks. (arXiv:2209.12871v2 [math.NA] UPDATED)
    In recent years operator networks have emerged as promising deep learning tools for approximating the solution to partial differential equations (PDEs). These networks map input functions that describe material properties, forcing functions and boundary data to the solution of a PDE. This work describes a new architecture for operator networks that mimics the form of the numerical solution obtained from an approximate variational or weak formulation of the problem. The application of these ideas to a generic elliptic PDE leads to a variationally mimetic operator network (VarMiON). Like the conventional Deep Operator Network (DeepONet) the VarMiON is also composed of a sub-network that constructs the basis functions for the output and another that constructs the coefficients for these basis functions. However, in contrast to the DeepONet, the architecture of these sub-networks in the VarMiON is precisely determined. An analysis of the error in the VarMiON solution reveals that it contains contributions from the error in the training data, the training error, the quadrature error in sampling input and output functions, and a "covering error" that measures the distance between the test input functions and the nearest functions in the training dataset. It also depends on the stability constants for the exact solution operator and its VarMiON approximation. The application of the VarMiON to a canonical elliptic PDE reveals that for approximately the same number of network parameters, on average the VarMiON incurs smaller errors than a standard DeepONet. Further, its performance is more robust to variations in input functions, the techniques used to sample the input and output functions, the techniques used to construct the basis functions, and the number of input functions.
    A Survey on Training Challenges in Generative Adversarial Networks for Biomedical Image Analysis. (arXiv:2201.07646v2 [cs.LG] UPDATED)
    In biomedical image analysis, the applicability of deep learning methods is directly impacted by the quantity of image data available. This is due to deep learning models requiring large image datasets to provide high-level performance. Generative Adversarial Networks (GANs) have been widely utilized to address data limitations through the generation of synthetic biomedical images. GANs consist of two models. The generator, a model that learns how to produce synthetic images based on the feedback it receives. The discriminator, a model that classifies an image as synthetic or real and provides feedback to the generator. Throughout the training process, a GAN can experience several technical challenges that impede the generation of suitable synthetic imagery. First, the mode collapse problem whereby the generator either produces an identical image or produces a uniform image from distinct input features. Second, the non-convergence problem whereby the gradient descent optimizer fails to reach a Nash equilibrium. Thirdly, the vanishing gradient problem whereby unstable training behavior occurs due to the discriminator achieving optimal classification performance resulting in no meaningful feedback being provided to the generator. These problems result in the production of synthetic imagery that is blurry, unrealistic, and less diverse. To date, there has been no survey article outlining the impact of these technical challenges in the context of the biomedical imagery domain. This work presents a review and taxonomy based on solutions to the training problems of GANs in the biomedical imaging domain. This survey highlights important challenges and outlines future research directions about the training of GANs in the domain of biomedical imagery.
    NISQ-ready community detection based on separation-node identification. (arXiv:2212.14717v1 [quant-ph])
    The analysis of network structure is essential to many scientific areas, ranging from biology to sociology. As the computational task of clustering these networks into partitions, i.e., solving the community detection problem, is generally NP-hard, heuristic solutions are indispensable. The exploration of expedient heuristics has led to the development of particularly promising approaches in the emerging technology of quantum computing. Motivated by the substantial hardware demands for all established quantum community detection approaches, we introduce a novel QUBO based approach that only needs number-of-nodes many qubits and is represented by a QUBO-matrix as sparse as the input graph's adjacency matrix. The substantial improvement on the sparsity of the QUBO-matrix, which is typically very dense in related work, is achieved through the novel concept of separation-nodes. Instead of assigning every node to a community directly, this approach relies on the identification of a separation-node set, which -- upon its removal from the graph -- yields a set of connected components, representing the core components of the communities. Employing a greedy heuristic to assign the nodes from the separation-node sets to the identified community cores, subsequent experimental results yield a proof of concept. This work hence displays a promising approach to NISQ ready quantum community detection, catalyzing the application of quantum computers for the network structure analysis of large scale, real world problem instances.
    Decoupled Self-supervised Learning for Graphs. (arXiv:2206.03601v2 [cs.LG] UPDATED)
    This paper studies the problem of conducting self-supervised learning for node representation learning on graphs. Most existing self-supervised learning methods assume the graph is homophilous, where linked nodes often belong to the same class or have similar features. However, such assumptions of homophily do not always hold in real-world graphs. We address this problem by developing a decoupled self-supervised learning (DSSL) framework for graph neural networks. DSSL imitates a generative process of nodes and links from latent variable modeling of the semantic structure, which decouples different underlying semantics between different neighborhoods into the self-supervised learning process. Our DSSL framework is agnostic to the encoders and does not need prefabricated augmentations, thus is flexible to different graphs. To effectively optimize the framework, we derive the evidence lower bound of the self-supervised objective and develop a scalable training algorithm with variational inference. We provide a theoretical analysis to justify that DSSL enjoys the better downstream performance. Extensive experiments on various types of graph benchmarks demonstrate that our proposed framework can achieve better performance compared with competitive baselines.
    Learning Representations from Dendrograms. (arXiv:1812.09225v4 [cs.LG] UPDATED)
    We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies.
    The Stable Artist: Steering Semantics in Diffusion Latent Space. (arXiv:2212.06013v2 [cs.CV] UPDATED)
    Large, text-conditioned generative diffusion models have recently gained a lot of attention for their impressive performance in generating high-fidelity images from text alone. However, achieving high-quality results is almost unfeasible in a one-shot fashion. On the contrary, text-guided image generation involves the user making many slight changes to inputs in order to iteratively carve out the envisioned image. However, slight changes to the input prompt often lead to entirely different images being generated, and thus the control of the artist is limited in its granularity. To provide flexibility, we present the Stable Artist, an image editing approach enabling fine-grained control of the image generation process. The main component is semantic guidance (SEGA) which steers the diffusion process along variable numbers of semantic directions. This allows for subtle edits to images, changes in composition and style, as well as optimization of the overall artistic conception. Furthermore, SEGA enables probing of latent spaces to gain insights into the representation of concepts learned by the model, even complex ones such as 'carbon emission'. We demonstrate the Stable Artist on several tasks, showcasing high-quality image editing and composition.
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap. (arXiv:2201.04469v8 [stat.ML] UPDATED)
    We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First, we review a lower bound derived by Kaufmann et al. (2016). Then, we propose the "Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)" strategy, which consists of the sampling rule using the Neyman allocation with an estimated standard deviation and the recommendation rule using an AIPW estimator. Our proposed strategy is optimal because the upper bound matches the lower bound when the budget goes to infinity and the gap goes to zero.
    Active Learning Through a Covering Lens. (arXiv:2205.11320v3 [cs.LG] UPDATED)
    Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a "covering" lens, proposing ProbCover - a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We then describe a dual way to view the proposed formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating ProbCover in the low-budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in the semi-supervised setting, allowing state-of-the-art semi-supervised methods to match the performance of fully supervised methods, while using much fewer labels nonetheless. Code is available at https://github.com/avihu111/TypiClust.
    Shift of Pairwise Similarities for Data Clustering. (arXiv:2110.13103v2 [cs.LG] UPDATED)
    Several clustering methods (e.g., Normalized Cut and Ratio Cut) divide the Min Cut cost function by a cluster dependent factor (e.g., the size or the degree of the clusters), in order to yield a more balanced partitioning. We, instead, investigate adding such regularizations to the original cost function. We first consider the case where the regularization term is the sum of the squared size of the clusters, and then generalize it to adaptive regularization of the pairwise similarities. This leads to shifting (adaptively) the pairwise similarities which might make some of them negative. We then study the connection of this method to Correlation Clustering and then propose an efficient local search optimization algorithm with fast theoretical convergence rate to solve the new clustering problem. In the following, we investigate the shift of pairwise similarities on some common clustering methods, and finally, we demonstrate the superior performance of the method by extensive experiments on different datasets.
    Spatio-Temporal Wind Speed Forecasting using Graph Networks and Novel Transformer Architectures. (arXiv:2208.13585v2 [cs.LG] UPDATED)
    This study focuses on multi-step spatio-temporal wind speed forecasting for the Norwegian continental shelf. The study aims to leverage spatial dependencies through the relative physical location of different measurement stations to improve local wind forecasts. Our multi-step forecasting models produce either 10-minute, 1- or 4-hour forecasts, with 10-minute resolution, meaning that the models produce more informative time series for predicted future trends. A graph neural network (GNN) architecture was used to extract spatial dependencies, with different update functions to learn temporal correlations. These update functions were implemented using different neural network architectures. One such architecture, the Transformer, has become increasingly popular for sequence modelling in recent years. Various alterations have been proposed to better facilitate time series forecasting, of which this study focused on the Informer, LogSparse Transformer and Autoformer. This is the first time the LogSparse Transformer and Autoformer have been applied to wind forecasting and the first time any of these or the Informer have been formulated in a spatio-temporal setting for wind forecasting. By comparing against spatio-temporal Long Short-Term Memory (LSTM) and Multi-Layer Perceptron (MLP) models, the study showed that the models using the altered Transformer architectures as update functions in GNNs were able to outperform these. Furthermore, we propose the Fast Fourier Transformer (FFTransformer), which is a novel Transformer architecture based on signal decomposition and consists of two separate streams that analyse the trend and periodic components separately. The FFTransformer and Autoformer were found to achieve superior results for the 10-minute and 1-hour ahead forecasts, with the FFTransformer significantly outperforming all other models for the 4-hour ahead forecasts.
    Joint Non-parametric Point Process model for Treatments and Outcomes: Counterfactual Time-series Prediction Under Policy Interventions. (arXiv:2209.04142v3 [cs.LG] UPDATED)
    Policy makers need to predict the progression of an outcome before adopting a new treatment policy, which defines when and how a sequence of treatments affecting the outcome occurs in continuous time. Commonly, algorithms that predict interventional future outcome trajectories take a fixed sequence of future treatments as input. This either neglects the dependence of future treatments on outcomes preceding them or implicitly assumes the treatment policy is known, and hence excludes scenarios where the policy is unknown or a counterfactual analysis is needed. To handle these limitations, we develop a joint model for treatments and outcomes, which allows for the estimation of treatment policies and effects from sequential treatment--outcome data. It can answer interventional and counterfactual queries about interventions on treatment policies, as we show with real-world data on blood glucose progression and a simulation study building on top of this.
    SmartGD: A GAN-Based Graph Drawing Framework for Diverse Aesthetic Goals. (arXiv:2206.06434v2 [cs.LG] UPDATED)
    A multitude of studies have been conducted on graph drawing, but many existing methods only focus on optimizing a single aesthetic aspect of graph layouts. There are a few existing methods that attempt to develop a flexible solution for optimizing different aesthetic aspects measured by different aesthetic criteria. Furthermore, thanks to the significant advance in deep learning techniques, several deep learning-based layout methods were proposed recently, which have demonstrated the advantages of the deep learning approaches for graph drawing. However, none of these existing methods can be directly applied to optimizing non-differentiable criteria without special accommodation. In this work, we propose a novel Generative Adversarial Network (GAN) based deep learning framework for graph drawing, called SmartGD, which can optimize any quantitative aesthetic goals even though they are non-differentiable. In the cases where the aesthetic goal is too abstract to be described mathematically, SmartGD can draw graphs in a similar style as a collection of good layout examples, which might be selected by humans based on the abstract aesthetic goal. To demonstrate the effectiveness and efficiency of SmartGD, we conduct experiments on minimizing stress, minimizing edge crossing, maximizing crossing angle, and a combination of multiple aesthetics. Compared with several popular graph drawing algorithms, the experimental results show that SmartGD achieves good performance both quantitatively and qualitatively.
    Experimental verification of the quantum nature of a neural network. (arXiv:2209.07577v2 [cs.NE] UPDATED)
    In my previous article I mentioned for the first time that a classical neural network may have quantum properties as its own structure may be entangled. The question one may ask now is whether such a quantum property can be used to entangle other systems? The answer should be yes, as shown in what follows.
    GStarX: Explaining Graph Neural Networks with Structure-Aware Cooperative Games. (arXiv:2201.12380v5 [cs.LG] UPDATED)
    Explaining machine learning models is an important and increasingly popular area of research interest. The Shapley value from game theory has been proposed as a prime approach to compute feature importance towards model predictions on images, text, tabular data, and recently graph neural networks (GNNs) on graphs. In this work, we revisit the appropriateness of the Shapley value for GNN explanation, where the task is to identify the most important subgraph and constituent nodes for GNN predictions. We claim that the Shapley value is a non-ideal choice for graph data because it is by definition not structure-aware. We propose a Graph Structure-aware eXplanation (GStarX) method to leverage the critical graph structure information to improve the explanation. Specifically, we define a scoring function based on a new structure-aware value from the cooperative game theory proposed by Hamiache and Navarro (HN). When used to score node importance, the HN value utilizes graph structures to attribute cooperation surplus between neighbor nodes, resembling message passing in GNNs, so that node importance scores reflect not only the node feature importance, but also the node structural roles. We demonstrate that GStarX produces qualitatively more intuitive explanations, and quantitatively improves explanation fidelity over strong baselines on chemical graph property prediction and text graph sentiment classification.
    Incremental Unsupervised Feature Selection for Dynamic Incomplete Multi-view Data. (arXiv:2204.02973v2 [cs.LG] UPDATED)
    Multi-view unsupervised feature selection has been proven to be efficient in reducing the dimensionality of multi-view unlabeled data with high dimensions. The previous methods assume all of the views are complete. However, in real applications, the multi-view data are often incomplete, i.e., some views of instances are missing, which will result in the failure of these methods. Besides, while the data arrive in form of streams, these existing methods will suffer the issues of high storage cost and expensive computation time. To address these issues, we propose an Incremental Incomplete Multi-view Unsupervised Feature Selection method (I$^2$MUFS) on incomplete multi-view streaming data. By jointly considering the consistent and complementary information across different views, I$^2$MUFS embeds the unsupervised feature selection into an extended weighted non-negative matrix factorization model, which can learn a consensus clustering indicator matrix and fuse different latent feature matrices with adaptive view weights. Furthermore, we introduce the incremental leaning mechanisms to develop an alternative iterative algorithm, where the feature selection matrix is incrementally updated, rather than recomputing on the entire updated data from scratch. A series of experiments are conducted to verify the effectiveness of the proposed method by comparing with several state-of-the-art methods. The experimental results demonstrate the effectiveness and efficiency of the proposed method in terms of the clustering metrics and the computational cost.
    Decision-making for Autonomous Vehicles on Highway: Deep Reinforcement Learning with Continuous Action Horizon. (arXiv:2008.11852v2 [cs.AI] UPDATED)
    Decision-making strategy for autonomous vehicles de-scribes a sequence of driving maneuvers to achieve a certain navigational mission. This paper utilizes the deep reinforcement learning (DRL) method to address the continuous-horizon decision-making problem on the highway. First, the vehicle kinematics and driving scenario on the freeway are introduced. The running objective of the ego automated vehicle is to execute an efficient and smooth policy without collision. Then, the particular algorithm named proximal policy optimization (PPO)-enhanced DRL is illustrated. To overcome the challenges in tardy training efficiency and sample inefficiency, this applied algorithm could realize high learning efficiency and excellent control performance. Finally, the PPO-DRL-based decision-making strategy is estimated from multiple perspectives, including the optimality, learning efficiency, and adaptability. Its potential for online application is discussed by applying it to similar driving scenarios.
    Optimal Decision Making in High-Throughput Virtual Screening Pipelines. (arXiv:2109.11683v2 [math.OC] UPDATED)
    The need for efficient computational screening of molecular candidates that possess desired properties frequently arises in various scientific and engineering problems, including drug discovery and materials design. However, the large size of the search space containing the candidates and the substantial computational cost of high-fidelity property prediction models makes screening practically challenging. In this work, we propose a general framework for constructing and optimizing a virtual screening (HTVS) pipeline that consists of multi-fidelity models. The central idea is to optimally allocate the computational resources to models with varying costs and accuracy to optimize the return-on-computational-investment (ROCI). Based on both simulated as well as real data, we demonstrate that the proposed optimal HTVS framework can significantly accelerate screening virtually without any degradation in terms of accuracy. Furthermore, it enables an adaptive operational strategy for HTVS, where one can trade accuracy for efficiency.
    MindBigData 2022 A Large Dataset of Brain Signals. (arXiv:2212.14746v1 [eess.SP])
    Understanding our brain is one of the most daunting tasks, one we cannot expect to complete without the use of technology. MindBigData aims to provide a comprehensive and updated dataset of brain signals related to a diverse set of human activities so it can inspire the use of machine learning algorithms as a benchmark of 'decoding' performance from raw brain activities into its corresponding (labels) mental (or physical) tasks. Using commercial of the self, EEG devices or custom ones built by us to explore the limits of the technology. We describe the data collection procedures for each of the sub datasets and with every headset used to capture them. Also, we report possible applications in the field of Brain Computer Interfaces or BCI that could impact the life of billions, in almost every sector like healthcare game changing use cases, industry or entertainment to name a few, at the end why not directly using our brains to 'disintermediate' senses, as the final HCI (Human-Computer Interaction) device? simply what we call the journey from Type to Touch to Talk to Think.
    Long-Tailed Learning Requires Feature Learning. (arXiv:2205.14553v3 [cs.LG] UPDATED)
    We propose a simple data model inspired from natural data such as text or images, and use it to study the importance of learning features in order to achieve good generalization. Our data model follows a long-tailed distribution in the sense that some rare subcategories have few representatives in the training set. In this context we provide evidence that a learner succeeds if and only if it identifies the correct features, and moreover derive non-asymptotic generalization error bounds that precisely quantify the penalty that one must pay for not learning features.
    On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method. (arXiv:2206.14796v2 [cs.CL] UPDATED)
    Most works on modeling the conversation history in Conversational Question Answering (CQA) report a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g. from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach, and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy-to-plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.
    Detecting Network-based Internet Censorship via Latent Feature Representation Learning. (arXiv:2209.05152v3 [cs.LG] UPDATED)
    Internet censorship is a phenomenon of societal importance and attracts investigation from multiple disciplines. Several research groups, such as Censored Planet, have deployed large scale Internet measurement platforms to collect network reachability data. However, existing studies generally rely on manually designed rules (i.e., using censorship fingerprints) to detect network-based Internet censorship from the data. While this rule-based approach yields a high true positive detection rate, it suffers from several challenges: it requires human expertise, is laborious, and cannot detect any censorship not captured by the rules. Seeking to overcome these challenges, we design and evaluate a classification model based on latent feature representation learning and an image-based classification model to detect network-based Internet censorship. To infer latent feature representations fromnetwork reachability data, we propose a sequence-to-sequence autoencoder to capture the structure and the order of data elements in the data. To estimate the probability of censorship events from the inferred latent features, we rely on a densely connected multi-layer neural network model. Our image-based classification model encodes a network reachability data record as a gray-scale image and classifies the image as censored or not using a dense convolutional neural network. We compare and evaluate both approaches using data sets from Censored Planet via a hold-out evaluation. Both classification models are capable of detecting network-based Internet censorship as we were able to identify instances of censorship not detected by the known fingerprints. Latent feature representations likely encode more nuances in the data since the latent feature learning approach discovers a greater quantity, and a more diverse set, of new censorship instances.
    Deep Hierarchy Quantization Compression algorithm based on Dynamic Sampling. (arXiv:2212.14760v1 [cs.LG])
    Unlike traditional distributed machine learning, federated learning stores data locally for training and then aggregates the models on the server, which solves the data security problem that may arise in traditional distributed machine learning. However, during the training process, the transmission of model parameters can impose a significant load on the network bandwidth. It has been pointed out that the vast majority of model parameters are redundant during model parameter transmission. In this paper, we explore the data distribution law of selected partial model parameters on this basis, and propose a deep hierarchical quantization compression algorithm, which further compresses the model and reduces the network load brought by data transmission through the hierarchical quantization of model parameters. And we adopt a dynamic sampling strategy for the selection of clients to accelerate the convergence of the model. Experimental results on different public datasets demonstrate the effectiveness of our algorithm.
    Carousel Memory: Rethinking the Design of Episodic Memory for Continual Learning. (arXiv:2110.07276v3 [cs.LG] UPDATED)
    Continual Learning (CL) is an emerging machine learning paradigm that aims to learn from a continuous stream of tasks without forgetting knowledge learned from the previous tasks. To avoid performance decrease caused by forgetting, prior studies exploit episodic memory (EM), which stores a subset of the past observed samples while learning from new non-i.i.d. data. Despite the promising results, since CL is often assumed to execute on mobile or IoT devices, the EM size is bounded by the small hardware memory capacity and makes it infeasible to meet the accuracy requirements for real-world applications. Specifically, all prior CL methods discard samples overflowed from the EM and can never retrieve them back for subsequent training steps, incurring loss of information that would exacerbate catastrophic forgetting. We explore a novel hierarchical EM management strategy to address the forgetting issue. In particular, in mobile and IoT devices, real-time data can be stored not just in high-speed RAMs but in internal storage devices as well, which offer significantly larger capacity than the RAMs. Based on this insight, we propose to exploit the abundant storage to preserve past experiences and alleviate the forgetting by allowing CL to efficiently migrate samples between memory and storage without being interfered by the slow access speed of the storage. We call it Carousel Memory (CarM). As CarM is complementary to existing CL methods, we conduct extensive evaluations of our method with seven popular CL methods and show that CarM significantly improves the accuracy of the methods across different settings by large margins in final average accuracy (up to 28.4%) while retaining the same training efficiency.
    Estimating Uncertainty in Neural Networks for Cardiac MRI Segmentation: A Benchmark Study. (arXiv:2012.15772v2 [eess.IV] UPDATED)
    Objective: Convolutional neural networks (CNNs) have demonstrated promise in automated cardiac magnetic resonance image segmentation. However, when using CNNs in a large real-world dataset, it is important to quantify segmentation uncertainty and identify segmentations which could be problematic. In this work, we performed a systematic study of Bayesian and non-Bayesian methods for estimating uncertainty in segmentation neural networks. Methods: We evaluated Bayes by Backprop, Monte Carlo Dropout, Deep Ensembles, and Stochastic Segmentation Networks in terms of segmentation accuracy, probability calibration, uncertainty on out-of-distribution images, and segmentation quality control. Results: We observed that Deep Ensembles outperformed the other methods except for images with heavy noise and blurring distortions. We showed that Bayes by Backprop is more robust to noise distortions while Stochastic Segmentation Networks are more resistant to blurring distortions. For segmentation quality control, we showed that segmentation uncertainty is correlated with segmentation accuracy for all the methods. With the incorporation of uncertainty estimates, we were able to reduce the percentage of poor segmentation to 5% by flagging 31--48% of the most uncertain segmentations for manual review, substantially lower than random review without using neural network uncertainty (reviewing 75--78% of all images). Conclusion: This work provides a comprehensive evaluation of uncertainty estimation methods and showed that Deep Ensembles outperformed other methods in most cases. Significance: Neural network uncertainty measures can help identify potentially inaccurate segmentations and alert users for manual review.
    Comparative Analysis of Clustering Techniques for Personalized Food Kit Distribution. (arXiv:2212.14874v1 [cs.LG])
    The Government of Kerala had increased the frequency of supply of free food kits owing to the pandemic, however, these items were static and not indicative of the personal preferences of the consumers. This paper conducts a comparative analysis of various clustering techniques on a scaled-down version of a real-world dataset obtained through a conjoint analysis-based survey. Clustering carried out by centroid-based methods such as k means is analyzed and the results are plotted along with SVD, and finally, a conclusion is reached as to which among the two is better. Once the clusters have been formulated, commodities are also decided upon for each cluster. Also, clustering is further enhanced by reassignment, based on a specific cluster loss threshold. Thus, the most efficacious clustering technique for designing a food kit tailored to the needs of individuals is finally obtained.
    DeLag: Using Multi-Objective Optimization to Enhance the Detection of Latency Degradation Patterns in Service-based Systems. (arXiv:2110.11155v3 [cs.SE] UPDATED)
    Performance debugging in production is a fundamental activity in modern service-based systems. The diagnosis of performance issues is often time-consuming, since it requires thorough inspection of large volumes of traces and performance indices. In this paper we present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag identifies subsets of requests that show, in the combination of their Remote Procedure Call execution times, symptoms of potentially relevant performance issues. We call such symptoms Latency Degradation Patterns. DeLag simultaneously searches for multiple latency degradation patterns while optimizing precision, recall and latency dissimilarity. Experimentation on 700 datasets of requests generated from two microservice-based systems shows that our approach provides better and more stable effectiveness than three state-of-the-art approaches and general purpose machine learning clustering algorithms. DeLag is more effective than all baseline techniques in at least one case study (with p $\leq$ 0.05 and non-negligible effect size). Moreover, DeLag outperforms in terms of efficiency the second and the third most effective baseline techniques on the largest datasets used in our evaluation (up to 22%).
    An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models. (arXiv:2212.14852v1 [cs.LG])
    With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.
    On Machine Learning Knowledge Representation In The Form Of Partially Unitary Operator. Knowledge Generalizing Operator. (arXiv:2212.14810v1 [cs.LG])
    A new form of ML knowledge representation with high generalization power is developed and implemented numerically. Initial $\mathit{IN}$ attributes and $\mathit{OUT}$ class label are transformed into the corresponding Hilbert spaces by considering localized wavefunctions. A partially unitary operator optimally converting a state from $\mathit{IN}$ Hilbert space into $\mathit{OUT}$ Hilbert space is then built from an optimization problem of transferring maximal possible probability from $\mathit{IN}$ to $\mathit{OUT}$, this leads to the formulation of a new algebraic problem. Constructed Knowledge Generalizing Operator $\mathcal{U}$ can be considered as a $\mathit{IN}$ to $\mathit{OUT}$ quantum channel; it is a partially unitary rectangular matrix of the dimension $\mathrm{dim}(\mathit{OUT}) \times \mathrm{dim}(\mathit{IN})$ transforming operators as $A^{\mathit{OUT}}=\mathcal{U} A^{\mathit{IN}} \mathcal{U}^{\dagger}$. Whereas only operator $\mathcal{U}$ projections squared are observable $\left\langle\mathit{OUT}|\mathcal{U}|\mathit{IN}\right\rangle^2$ (probabilities), the fundamental equation is formulated for the operator $\mathcal{U}$ itself. This is the reason of high generalizing power of the approach; the situation is the same as for the Schr\"{o}dinger equation: we can only measure $\psi^2$, but the equation is written for $\psi$ itself.
    A deep real options policy for sequential service region design and timing. (arXiv:2212.14800v1 [cs.LG])
    As various city agencies and mobility operators navigate toward innovative mobility solutions, there is a need for strategic flexibility in well-timed investment decisions in the design and timing of mobility service regions, i.e. cast as "real options" (RO). This problem becomes increasingly challenging with multiple interacting RO in such investments. We propose a scalable machine learning based RO framework for multi-period sequential service region design & timing problem for mobility-on-demand services, framed as a Markov decision process with non-stationary stochastic variables. A value function approximation policy from literature uses multi-option least squares Monte Carlo simulation to get a policy value for a set of interdependent investment decisions as deferral options (CR policy). The goal is to determine the optimal selection and timing of a set of zones to include in a service region. However, prior work required explicit enumeration of all possible sequences of investments. To address the combinatorial complexity of such enumeration, we propose a new variant "deep" RO policy using an efficient recurrent neural network (RNN) based ML method (CR-RNN policy) to sample sequences to forego the need for enumeration, making network design & timing policy tractable for large scale implementation. Experiments on multiple service region scenarios in New York City (NYC) shows the proposed policy substantially reduces the overall computational cost (time reduction for RO evaluation of > 90% of total investment sequences is achieved), with zero to near-zero gap compared to the benchmark. A case study of sequential service region design for expansion of MoD services in Brooklyn, NYC show that using the CR-RNN policy to determine optimal RO investment strategy yields a similar performance (0.5% within CR policy value) with significantly reduced computation time (about 5.4 times faster).
    Online Statistical Inference for Contextual Bandits via Stochastic Gradient Descent. (arXiv:2212.14883v1 [stat.ML])
    With the fast development of big data, it has been easier than before to learn the optimal decision rule by updating the decision rule recursively and making online decisions. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.
    The Feasibility and Inevitability of Stealth Attacks. (arXiv:2106.13997v3 [cs.CR] UPDATED)
    We develop and study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence (AI) systems including deep learning neural networks. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself. Such a stealth attack could be conducted by a mischievous, corrupt or disgruntled member of a software development team. It could also be made by those wishing to exploit a ``democratization of AI'' agenda, where network architectures and trained parameter sets are shared publicly. We develop a range of new implementable attack strategies with accompanying analysis, showing that with high probability a stealth attack can be made transparent, in the sense that system performance is unchanged on a fixed validation set which is unknown to the attacker, while evoking any desired output on a trigger input of interest. The attacker only needs to have estimates of the size of the validation set and the spread of the AI's relevant latent space. In the case of deep learning neural networks, we show that a one neuron attack is possible - a modification to the weights and bias associated with a single neuron - revealing a vulnerability arising from over-parameterization. We illustrate these concepts using state of the art architectures on two standard image data sets. Guided by the theory and computational results, we also propose strategies to guard against stealth attacks.
    A Unified Framework for Online Trip Destination Prediction. (arXiv:2101.04520v2 [cs.LG] UPDATED)
    Trip destination prediction is an area of increasing importance in many applications such as trip planning, autonomous driving and electric vehicles. Even though this problem could be naturally addressed in an online learning paradigm where data is arriving in a sequential fashion, the majority of research has rather considered the offline setting. In this paper, we present a unified framework for trip destination prediction in an online setting, which is suitable for both online training and online prediction. For this purpose, we develop two clustering algorithms and integrate them within two online prediction models for this problem. We investigate the different configurations of clustering algorithms and prediction models on a real-world dataset. We demonstrate that both the clustering and the entire framework yield consistent results compared to the offline setting. Finally, we propose a novel regret metric for evaluating the entire online framework in comparison to its offline counterpart. This metric makes it possible to relate the source of erroneous predictions to either the clustering or the prediction model. Using this metric, we show that the proposed methods converge to a probability distribution resembling the true underlying distribution with a lower regret than all of the baselines.
    Understanding and Accelerating Neural Architecture Search with Training-Free and Theory-Grounded Metrics. (arXiv:2108.11939v2 [cs.LG] UPDATED)
    This work targets designing a principled and unified training-free framework for Neural Architecture Search (NAS), with high performance, low cost, and in-depth interpretation. NAS has been explosively studied to automate the discovery of top-performer neural networks, but suffers from heavy resource consumption and often incurs search bias due to truncated training or approximations. Recent NAS works start to explore indicators that can predict a network's performance without training. However, they either leveraged limited properties of deep networks, or the benefits of their training-free indicators are not applied to more extensive search methods. By rigorous correlation analysis, we present a unified framework to understand and accelerate NAS, by disentangling "TEG" characteristics of searched networks - Trainability, Expressivity, Generalization - all assessed in a training-free manner. The TEG indicators could be scaled up and integrated with various NAS search methods, including both supernet and single-path approaches. Extensive studies validate the effective and efficient guidance from our TEG-NAS framework, leading to both improved search accuracy and over 56% reduction in search time cost. Moreover, we visualize search trajectories on three landscapes of "TEG" characteristics, observing that while a good local minimum is easier to find on NAS-Bench-201 given its simple topology, balancing "TEG" characteristics is much harder on the DARTS search space due to its complex landscape geometry. Our code is available at https://github.com/VITA-Group/TEGNAS.
    The Improvement of Decision Tree Construction Algorithm Based On Quantum Heuristic Algorithms. (arXiv:2212.14725v1 [quant-ph])
    This work is related to the implementation of a decision tree construction algorithm on a quantum simulator. Here we consider an algorithm based on a binary criterion. Also, we study the improvement capability with quantum heuristic QAOA. We implemented the classical and the quantum version of this algorithm to compare built trees.
    Personalized Student Attribute Inference. (arXiv:2212.14682v1 [cs.CY])
    Accurately predicting their future performance can ensure students successful graduation, and help them save both time and money. However, achieving such predictions faces two challenges, mainly due to the diversity of students' background and the necessity of continuously tracking their evolving progress. The goal of this work is to create a system able to automatically detect students in difficulty, for instance predicting if they are likely to fail a course. We compare a naive approach widely used in the literature, which uses attributes available in the data set (like the grades), with a personalized approach we called Personalized Student Attribute Inference (PSAI). With our model, we create personalized attributes to capture the specific background of each student. Both approaches are compared using machine learning algorithms like decision trees, support vector machine or neural networks.
    Improving Convergence for Quantum Variational Classifiers using Weight Re-Mapping. (arXiv:2212.14807v1 [quant-ph])
    In recent years, quantum machine learning has seen a substantial increase in the use of variational quantum circuits (VQCs). VQCs are inspired by artificial neural networks, which achieve extraordinary performance in a wide range of AI tasks as massively parameterized function approximators. VQCs have already demonstrated promising results, for example, in generalization and the requirement for fewer parameters to train, by utilizing the more robust algorithmic toolbox available in quantum computing. A VQCs' trainable parameters or weights are usually used as angles in rotational gates and current gradient-based training methods do not account for that. We introduce weight re-mapping for VQCs, to unambiguously map the weights to an interval of length $2\pi$, drawing inspiration from traditional ML, where data rescaling, or normalization techniques have demonstrated tremendous benefits in many circumstances. We employ a set of five functions and evaluate them on the Iris and Wine datasets using variational classifiers as an example. Our experiments show that weight re-mapping can improve convergence in all tested settings. Additionally, we were able to demonstrate that weight re-mapping increased test accuracy for the Wine dataset by $10\%$ over using unmodified weights.
    Lab-scale Vibration Analysis Dataset and Baseline Methods for Machinery Fault Diagnosis with Machine Learning. (arXiv:2212.14732v1 [eess.SP])
    The monitoring of machine conditions in a plant is crucial for production in manufacturing. A sudden failure of a machine can stop production and cause a loss of revenue. The vibration signal of a machine is a good indicator of its condition. This paper presents a dataset of vibration signals from a lab-scale machine. The dataset contains four different types of machine conditions: normal, unbalance, misalignment, and bearing fault. Three machine learning methods (SVM, KNN, and GNB) evaluated the dataset, and a perfect result was obtained by one of the methods on a 1-fold test. The performance of the algorithms is evaluated using weighted accuracy (WA) since the data is balanced. The results show that the best-performing algorithm is the SVM with a WA of 99.75\% on the 5-fold cross-validations. The dataset is provided in the form of CSV files in an open and free repository at https://zenodo.org/record/7006575.
    On Biased Compression for Distributed Learning. (arXiv:2002.12410v3 [cs.LG] UPDATED)
    In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {\em biased} compressors often show superior performance in practice when compared to the much more studied and understood {\em unbiased} compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $\mathcal{O}\left( \delta L \exp[-\frac{\mu K}{\delta L}] + \frac{(C + \delta D)}{K\mu}\right)$, where $\delta\ge1$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
    Cross-Domain Shopping and Stock Trend Analysis. (arXiv:2212.14689v1 [q-fin.ST])
    This paper presents a cross-domain trend analysis that aims to identify and analyze the relationships between stock prices, stock news on Twitter, and users' behaviors on e-commerce websites. The analysis is based on three datasets: a US stock dataset, a stock tweets dataset, and an e-commerce behavior dataset. The analysis is performed using Hadoop, Hive, and Tableau, allowing for efficient and scalable processing and visualizing large datasets. The analysis includes trend analysis of Twitter sentiment (positive and negative tweets) and correlation analysis, including the correlation between tweet sentiment and stocks, the correlation between stock trends and shopping behavior, and the understanding of data based on different slices of time. By comparing different features from the datasets over time, we hope to gain insight into the factors that drive user behavior as well as the market in different categories. The results of this analysis can provide valuable insights for businesses and investors to inform decision-making. We believe that our analysis can serve as a valuable starting point for further research and investigation into these topics.
    Unsupervised learning for structure detection in plastically deformed crystals. (arXiv:2212.14813v1 [cond-mat.mtrl-sci])
    Detecting structures at the particle scale within plastically deformed crystalline materials allows a better understanding of the occurring phenomena. While previous approaches mostly relied on applying hand-chosen criteria on different local parameters, these approaches could only detect already known structures.We introduce an unsupervised learning algorithm to automatically detect structures within a crystal under plastic deformation. This approach is based on a study developed for structural detection on colloidal materials. This algorithm has the advantage of being computationally fast and easy to implement. We show that by using local parameters based on bond-angle distributions, we are able to detect more structures and with a higher degree of precision than traditional hand-made criteria.
    Unbox the Blackbox: Predict and Interpret YouTube Viewership Using Deep Learning. (arXiv:2101.01076v7 [cs.LG] UPDATED)
    Predicting video viewership is a top priority for content creators and video-sharing sites. Content creators live on such predictions to maximize influences and minimize budgets. Video-sharing sites rely on this prediction to promote credible videos and curb violative videos. Although deep learning champions viewership prediction, it lacks interpretability, which is fundamental to increasing the adoption of predictive models and prescribing measurements to improve viewership. Following the design-science paradigm, we propose a novel interpretable IT system, Precise Wide and Deep Learning (PrecWD), to precisely interpret viewership prediction. Improving upon state-of-the-art frameworks, PrecWD offers precise feature effects and designs an unstructured component. PrecWD outperforms benchmarks in two contexts: health video viewership prediction and misinformation viewership prediction. A user study confirms the superior interpretability of PrecWD. This study contributes to IS design theory with generalizable design principles and an interpretable predictive framework. Our findings provide implications to improve video viewership and credibility.
    A Comparison Study of Deep CNN Architecture in Detecting of Pneumonia. (arXiv:2212.14744v1 [eess.IV])
    Pneumonia, a respiratory infection brought on by bacteria or viruses, affects a large number of people, especially in developing and impoverished countries where high levels of pollution, unclean living conditions, and overcrowding are frequently observed, along with insufficient medical infrastructure. Pleural effusion, a condition in which fluids fill the lung and complicate breathing, is brought on by pneumonia. Early detection of pneumonia is essential for ensuring curative care and boosting survival rates. The approach most usually used to diagnose pneumonia is chest X-ray imaging. The purpose of this work is to develop a method for the automatic diagnosis of bacterial and viral pneumonia in digital x-ray pictures. This article first presents the authors' technique, and then gives a comprehensive report on recent developments in the field of reliable diagnosis of pneumonia. In this study, here tuned a state-of-the-art deep convolutional neural network to classify plant diseases based on images and tested its performance. Deep learning architecture is compared empirically. VGG19, ResNet with 152v2, Resnext101, Seresnet152, Mobilenettv2, and DenseNet with 201 layers are among the architectures tested. Experiment data consists of two groups, sick and healthy X-ray pictures. To take appropriate action against plant diseases as soon as possible, rapid disease identification models are preferred. DenseNet201 has shown no overfitting or performance degradation in our experiments, and its accuracy tends to increase as the number of epochs increases. Further, DenseNet201 achieves state-of-the-art performance with a significantly a smaller number of parameters and within a reasonable computing time. This architecture outperforms the competition in terms of testing accuracy, scoring 95%. Each architecture was trained using Keras, using Theano as the backend.
    Industrial Scene Change Detection using Deep Convolutional Neural Networks. (arXiv:2212.14278v1 [cs.CV])
    Finding and localizing the conceptual changes in two scenes in terms of the presence or removal of objects in two images belonging to the same scene at different times in special care applications is of great significance. This is mainly due to the fact that addition or removal of important objects for some environments can be harmful. As a result, there is a need to design a program that locates these differences using machine vision. The most important challenge of this problem is the change in lighting conditions and the presence of shadows in the scene. Therefore, the proposed methods must be resistant to these challenges. In this article, a method based on deep convolutional neural networks using transfer learning is introduced, which is trained with an intelligent data synthesis process. The results of this method are tested and presented on the dataset provided for this purpose. It is shown that the presented method is more efficient than other methods and can be used in a variety of real industrial environments.
    Domain-specific transfer learning in the automated scoring of tumor-stroma ratio from histopathological images of colorectal cancer. (arXiv:2212.14652v1 [cs.CV])
    Tumor-stroma ratio (TSR) is a prognostic factor for many types of solid tumors. In this study, we propose a method for automated estimation of TSR from histopathological images of colorectal cancer. The method is based on convolutional neural networks which were trained to classify colorectal cancer tissue in hematoxylin-eosin stained samples into three classes: stroma, tumor and other. The models were trained using a data set that consists of 1343 whole slide images. Three different training setups were applied with a transfer learning approach using domain-specific data i.e. an external colorectal cancer histopathological data set. The three most accurate models were chosen as a classifier, TSR values were predicted and the results were compared to a visual TSR estimation made by a pathologist. The results suggest that classification accuracy does not improve when domain-specific data are used in the pre-training of the convolutional neural network models in the task at hand. Classification accuracy for stroma, tumor and other reached 96.1$\%$ on an independent test set. Among the three classes the best model gained the highest accuracy (99.3$\%$) for class tumor. When TSR was predicted with the best model, the correlation between the predicted values and values estimated by an experienced pathologist was 0.57. Further research is needed to study associations between computationally predicted TSR values and other clinicopathological factors of colorectal cancer and the overall survival of the patients.
    Extended method for Statistical Signal Characterization using moments and cumulants: Application to recognition of pattern alterations in pulse-like waveforms employing Artificial Neural Networks. (arXiv:2212.14783v1 [eess.SP])
    We propose a statistical procedure to characterize and extract features from a waveform that can be applied as a pre-processing signal stage in a pattern recognition task using Artificial Neural Networks. Such a procedure is based on measuring a 30-parameters set of moments and cumulants from the waveform, its derivative, and its integral. The technique is presented as an extension of the Statistical Signal Characterization method existing in the literature. As a testing methodology, we used the procedure to distinguish a pulse-like signal from different versions of itself with frequency spectrum alterations or deformations. The recognition task was performed by single feed-forward back-propagation networks trained for the case Sinc-, Gaussian-, and Chirp-pulse waveform. Because of the success obtained in these examples, we can conclude that the proposed extended statistical signal characterization method is an effective tool for pattern-recognition applications. In particular, we can use it as a fast pre-processing stage in embedded systems with limited memory or computational capability.
    A Learned Simulation Environment to Model Student Engagement and Retention in Automated Online Courses. (arXiv:2212.14693v1 [cs.CY])
    We developed a simulator to quantify the effect of exercise ordering on both student engagement and retention. Our approach combines the construction of neural network representations for users and exercises using a dynamic matrix factorization method. We further created a machine learning models of success and dropout prediction. As a result, our system is able to predict student engagement and retention based on a given sequence of exercises selected. This opens the door to the development of versatile reinforcement learning agents which can substitute the role of private tutoring in exam preparation.
    Non-intrusive surrogate modelling using sparse random features with applications in crashworthiness analysis. (arXiv:2212.14507v1 [cs.LG])
    Efficient surrogate modelling is a key requirement for uncertainty quantification in data-driven scenarios. In this work, a novel approach of using Sparse Random Features for surrogate modelling in combination with self-supervised dimensionality reduction is described. The method is compared to other methods on synthetic and real data obtained from crashworthiness analyses. The results show a superiority of the here described approach over state of the art surrogate modelling techniques, Polynomial Chaos Expansions and Neural Networks.
    MAUVE Scores for Generative Models: Theory and Practice. (arXiv:2212.14578v1 [cs.LG])
    Generative AI has matured to a point where large-scale models can generate text that seems indistinguishable from human-written text and remarkably photorealistic images. Automatically measuring how close the distribution of generated data is to the target real data distribution is a key step in diagnosing existing models and developing better models. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore four approaches to statistically estimate these scores: vector quantization, non-parametric estimation, classifier-based estimation, and parametric Gaussian approximations. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We conclude the paper by demonstrating its applications to other AI domains and discussing practical recommendations.
    Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping. (arXiv:2107.05341v3 [cs.LG] UPDATED)
    We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell 2$ -regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).
    Hamiltonian Deep Neural Networks Guaranteeing Non-vanishing Gradients by Design. (arXiv:2105.13205v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) training can be difficult due to vanishing and exploding gradients during weight optimization through backpropagation. To address this problem, we propose a general class of Hamiltonian DNNs (H-DNNs) that stem from the discretization of continuous-time Hamiltonian systems and include several existing DNN architectures based on ordinary differential equations. Our main result is that a broad set of H-DNNs ensures non-vanishing gradients by design for an arbitrary network depth. This is obtained by proving that, using a semi-implicit Euler discretization scheme, the backward sensitivity matrices involved in gradient computations are symplectic. We also provide an upper-bound to the magnitude of sensitivity matrices and show that exploding gradients can be controlled through regularization. Finally, we enable distributed implementations of backward and forward propagation algorithms in H-DNNs by characterizing appropriate sparsity constraints on the weight matrices. The good performance of H-DNNs is demonstrated on benchmark classification problems, including image classification with the MNIST dataset.
    Pain level and pain-related behaviour classification using GRU-based sparsely-connected RNNs. (arXiv:2212.14806v1 [eess.SP])
    There is a growing body of studies on applying deep learning to biometrics analysis. Certain circumstances, however, could impair the objective measures and accuracy of the proposed biometric data analysis methods. For instance, people with chronic pain (CP) unconsciously adapt specific body movements to protect themselves from injury or additional pain. Because there is no dedicated benchmark database to analyse this correlation, we considered one of the specific circumstances that potentially influence a person's biometrics during daily activities in this study and classified pain level and pain-related behaviour in the EmoPain database. To achieve this, we proposed a sparsely-connected recurrent neural networks (s-RNNs) ensemble with the gated recurrent unit (GRU) that incorporates multiple autoencoders using a shared training framework. This architecture is fed by multidimensional data collected from inertial measurement unit (IMU) and surface electromyography (sEMG) sensors. Furthermore, to compensate for variations in the temporal dimension that may not be perfectly represented in the latent space of s-RNNs, we fused hand-crafted features derived from information-theoretic approaches with represented features in the shared hidden state. We conducted several experiments which indicate that the proposed method outperforms the state-of-the-art approaches in classifying both pain level and pain-related behaviour.
    Machine Learning and Thermography Applied to the Detection and Classification of Cracks in Building. (arXiv:2212.14730v1 [cs.CV])
    Due to the environmental impacts caused by the construction industry, repurposing existing buildings and making them more energy-efficient has become a high-priority issue. However, a legitimate concern of land developers is associated with the buildings' state of conservation. For that reason, infrared thermography has been used as a powerful tool to characterize these buildings' state of conservation by detecting pathologies, such as cracks and humidity. Thermal cameras detect the radiation emitted by any material and translate it into temperature-color-coded images. Abnormal temperature changes may indicate the presence of pathologies, however, reading thermal images might not be quite simple. This research project aims to combine infrared thermography and machine learning (ML) to help stakeholders determine the viability of reusing existing buildings by identifying their pathologies and defects more efficiently and accurately. In this particular phase of this research project, we've used an image classification machine learning model of Convolutional Neural Networks (DCNN) to differentiate three levels of cracks in one particular building. The model's accuracy was compared between the MSX and thermal images acquired from two distinct thermal cameras and fused images (formed through multisource information) to test the influence of the input data and network on the detection results.
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v7 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.
    PAC-Bayesian-Like Error Bound for a Class of Linear Time-Invariant Stochastic State-Space Models. (arXiv:2212.14838v1 [stat.ML])
    In this paper we derive a PAC-Bayesian-Like error bound for a class of stochastic dynamical systems with inputs, namely, for linear time-invariant stochastic state-space models (stochastic LTI systems for short). This class of systems is widely used in control engineering and econometrics, in particular, they represent a special case of recurrent neural networks. In this paper we 1) formalize the learning problem for stochastic LTI systems with inputs, 2) derive a PAC-Bayesian-Like error bound for such systems, 3) discuss various consequences of this error bound.
    On the Interpretability of Attention Networks. (arXiv:2212.14776v1 [cs.LG])
    Attention mechanisms form a core component of several successful deep learning architectures, and are based on one key idea: ''The output depends only on a small (but unknown) segment of the input.'' In several practical applications like image captioning and language translation, this is mostly true. In trained models with an attention mechanism, the outputs of an intermediate module that encodes the segment of input responsible for the output is often used as a way to peek into the `reasoning` of the network. We make such a notion more precise for a variant of the classification problem that we term selective dependence classification (SDC) when used with attention model architectures. Under such a setting, we demonstrate various error modes where an attention model can be accurate but fail to be interpretable, and show that such models do occur as a result of training. We illustrate various situations that can accentuate and mitigate this behaviour. Finally, we use our objective definition of interpretability for SDC tasks to evaluate a few attention model learning algorithms designed to encourage sparsity and demonstrate that these algorithms help improve interpretability.
    ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports. (arXiv:2212.14882v1 [cs.CL])
    The release of ChatGPT, a language model capable of generating text that appears human-like and authentic, has gained significant attention beyond the research community. We expect that the convincing performance of ChatGPT incentivizes users to apply it to a variety of downstream tasks, including prompting the model to simplify their own medical reports. To investigate this phenomenon, we conducted an exploratory case study. In a questionnaire, we asked 15 radiologists to assess the quality of radiology reports simplified by ChatGPT. Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed key medical findings, and potentially harmful passages were reported. While further studies are needed, the initial insights of this study indicate a great potential in using large language models like ChatGPT to improve patient-centered care in radiology and other medical domains.
    A Learning-Based Optimal Uncertainty Quantification Method and Its Application to Ballistic Impact Problems. (arXiv:2212.14709v1 [cs.LG])
    This paper concerns the study of optimal (supremum and infimum) uncertainty bounds for systems where the input (or prior) probability measure is only partially/imperfectly known (e.g., with only statistical moments and/or on a coarse topology) rather than fully specified. Such partial knowledge provides constraints on the input probability measures. The theory of Optimal Uncertainty Quantification allows us to convert the task into a constraint optimization problem where one seeks to compute the least upper/greatest lower bound of the system's output uncertainties by finding the extremal probability measure of the input. Such optimization requires repeated evaluation of the system's performance indicator (input to performance map) and is high-dimensional and non-convex by nature. Therefore, it is difficult to find the optimal uncertainty bounds in practice. In this paper, we examine the use of machine learning, especially deep neural networks, to address the challenge. We achieve this by introducing a neural network classifier to approximate the performance indicator combined with the stochastic gradient descent method to solve the optimization problem. We demonstrate the learning based framework on the uncertainty quantification of the impact of magnesium alloys, which are promising light-weight structural and protective materials. Finally, we show that the approach can be used to construct maps for the performance certificate and safety design in engineering practice.
    Boosting Simple Learners. (arXiv:2001.11704v7 [cs.LG] UPDATED)
    Boosting is a celebrated machine learning approach which is based on the idea of combining weak and moderately inaccurate hypotheses to a strong and accurate one. We study boosting under the assumption that the weak hypotheses belong to a class of bounded capacity. This assumption is inspired by the common convention that weak hypotheses are "rules-of-thumbs" from an "easy-to-learn class". (Schapire and Freund~'12, Shalev-Shwartz and Ben-David '14.) Formally, we assume the class of weak hypotheses has a bounded VC dimension. We focus on two main questions: (i) Oracle Complexity: How many weak hypotheses are needed to produce an accurate hypothesis? We design a novel boosting algorithm and demonstrate that it circumvents a classical lower bound by Freund and Schapire ('95, '12). Whereas the lower bound shows that $\Omega({1}/{\gamma^2})$ weak hypotheses with $\gamma$-margin are sometimes necessary, our new method requires only $\tilde{O}({1}/{\gamma})$ weak hypothesis, provided that they belong to a class of bounded VC dimension. Unlike previous boosting algorithms which aggregate the weak hypotheses by majority votes, the new boosting algorithm uses more complex ("deeper") aggregation rules. We complement this result by showing that complex aggregation rules are in fact necessary to circumvent the aforementioned lower bound. (ii) Expressivity: Which tasks can be learned by boosting weak hypotheses from a bounded VC class? Can complex concepts that are "far away" from the class be learned? Towards answering the first question we {introduce combinatorial-geometric parameters which capture expressivity in boosting.} As a corollary we provide an affirmative answer to the second question for well-studied classes, including half-spaces and decision stumps. Along the way, we establish and exploit connections with Discrepancy Theory.
    ComplAI: Theory of A Unified Framework for Multi-factor Assessment of Black-Box Supervised Machine Learning Models. (arXiv:2212.14599v1 [cs.LG])
    The advances in Artificial Intelligence are creating new opportunities to improve lives of people around the world, from business to healthcare, from lifestyle to education. For example, some systems profile the users using their demographic and behavioral characteristics to make certain domain-specific predictions. Often, such predictions impact the life of the user directly or indirectly (e.g., loan disbursement, determining insurance coverage, shortlisting applications, etc.). As a result, the concerns over such AI-enabled systems are also increasing. To address these concerns, such systems are mandated to be responsible i.e., transparent, fair, and explainable to developers and end-users. In this paper, we present ComplAI, a unique framework to enable, observe, analyze and quantify explainability, robustness, performance, fairness, and model behavior in drift scenarios, and to provide a single Trust Factor that evaluates different supervised Machine Learning models not just from their ability to make correct predictions but from overall responsibility perspective. The framework helps users to (a) connect their models and enable explanations, (b) assess and visualize different aspects of the model, such as robustness, drift susceptibility, and fairness, and (c) compare different models (from different model families or obtained through different hyperparameter settings) from an overall perspective thereby facilitating actionable recourse for improvement of the models. It is model agnostic and works with different supervised machine learning scenarios (i.e., Binary Classification, Multi-class Classification, and Regression) and frameworks. It can be seamlessly integrated with any ML life-cycle framework. Thus, this already deployed framework aims to unify critical aspects of Responsible AI systems for regulating the development process of such real systems.
    Macro-block dropout for improved regularization in training end-to-end speech recognition models. (arXiv:2212.14149v1 [cs.LG])
    This paper proposes a new regularization algorithm referred to as macro-block dropout. The overfitting issue has been a difficult problem in training large neural network models. The dropout technique has proven to be simple yet very effective for regularization by preventing complex co-adaptations during training. In our work, we define a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN). Rather than applying dropout to each unit, we apply random dropout to each macro-block. This algorithm has the effect of applying different drop out rates for each layer even if we keep a constant average dropout rate, which has better regularization effects. In our experiments using Recurrent Neural Network-Transducer (RNN-T), this algorithm shows relatively 4.30 % and 6.13 % Word Error Rates (WERs) improvement over the conventional dropout on LibriSpeech test-clean and test-other. With an Attention-based Encoder-Decoder (AED) model, this algorithm shows relatively 4.36 % and 5.85 % WERs improvement over the conventional dropout on the same test sets.
    Visual CPG-RL: Learning Central Pattern Generators for Visually-Guided Quadruped Navigation. (arXiv:2212.14400v1 [cs.RO])
    In this paper, we present a framework for learning quadruped navigation by integrating central pattern generators (CPGs), i.e. systems of coupled oscillators, into the deep reinforcement learning (DRL) framework. Through both exteroceptive and proprioceptive sensing, the agent learns to modulate the intrinsic oscillator setpoints (amplitude and frequency) and coordinate rhythmic behavior among different oscillators to track velocity commands while avoiding collisions with the environment. We compare different neural network architectures (i.e. memory-free and memory-enabled) which learn implicit interoscillator couplings, as well as varying the strength of the explicit coupling weights in the oscillator dynamics equations. We train our policies in simulation and perform a sim-to-real transfer to the Unitree Go1 quadruped, where we observe robust navigation in a variety of scenarios. Our results show that both memory-enabled policy representations and explicit interoscillator couplings are beneficial for a successful sim-to-real transfer for navigation tasks. Video results can be found at https://youtu.be/O_LX1oLZOe0.
    Node-Element Hypergraph Message Passing for Fluid Dynamics Simulations. (arXiv:2212.14545v1 [physics.flu-dyn])
    A recent trend in deep learning research features the application of graph neural networks for mesh-based continuum mechanics simulations. Most of these frameworks operate on graphs in which each edge connects two nodes. Inspired by the data connectivity in the finite element method, we connect the nodes by elements rather than edges, effectively forming a hypergraph. We implement a message-passing network on such a node-element hypergraph and explore the capability of the network for the modeling of fluid flow. The network is tested on two common benchmark problems, namely the fluid flow around a circular cylinder and airfoil configurations. The results show that such a message-passing network defined on the node-element hypergraph is able to generate more stable and accurate temporal roll-out predictions compared to the baseline generalized message-passing network defined on a normal graph. Along with adjustments in activation function and training loss, we expect this work to set a new strong baseline for future explorations of mesh-based fluid simulations with graph neural networks.
    Machine Learning as an Accurate Predictor for Percolation Threshold of Diverse Networks. (arXiv:2212.14694v1 [physics.soc-ph])
    Percolation threshold is an important measure to determine the inherent rigidity of large networks. Predictors of the percolation threshold for large networks are computationally intense to run, hence it is a necessity to develop predictors of the percolation threshold of networks, that do not rely on numerical simulations. We demonstrate the efficacy of five machine learning-based regression techniques for the accurate prediction of the percolation threshold. The dataset generated to train the machine learning models contains a total of 777 real and synthetic networks and consists of 5 statistical and structural properties of networks as features and the numerically computed percolation threshold as the output attribute. We establish that the machine learning models outperform three existing empirical estimators of bond percolation threshold, and extend this experiment to predict site and explosive percolation. We also compare the performance of our models in predicting the percolation threshold and find that the gradient boosting regressor, multilayer perceptron and random forests regression models achieve the least RMSE values among the models utilized.
    Cluster-level Group Representativity Fairness in $k$-means Clustering. (arXiv:2212.14467v1 [cs.LG])
    There has been much interest recently in developing fair clustering algorithms that seek to do justice to the representation of groups defined along sensitive attributes such as race and gender. We observe that clustering algorithms could generate clusters such that different groups are disadvantaged within different clusters. We develop a clustering algorithm, building upon the centroid clustering paradigm pioneered by classical algorithms such as $k$-means, where we focus on mitigating the unfairness experienced by the most-disadvantaged group within each cluster. Our method uses an iterative optimisation paradigm whereby an initial cluster assignment is modified by reassigning objects to clusters such that the worst-off sensitive group within each cluster is benefitted. We demonstrate the effectiveness of our method through extensive empirical evaluations over a novel evaluation metric on real-world datasets. Specifically, we show that our method is effective in enhancing cluster-level group representativity fairness significantly at low impact on cluster coherence.
    Fruit Ripeness Classification: a Survey. (arXiv:2212.14441v1 [cs.CV])
    Fruit is a key crop in worldwide agriculture feeding millions of people. The standard supply chain of fruit products involves quality checks to guarantee freshness, taste, and, most of all, safety. An important factor that determines fruit quality is its stage of ripening. This is usually manually classified by experts in the field, which makes it a labor-intensive and error-prone process. Thus, there is an arising need for automation in the process of fruit ripeness classification. Many automatic methods have been proposed that employ a variety of feature descriptors for the food item to be graded. Machine learning and deep learning techniques dominate the top-performing methods. Furthermore, deep learning can operate on raw data and thus relieve the users from having to compute complex engineered features, which are often crop-specific. In this survey, we review the latest methods proposed in the literature to automatize fruit ripeness classification, highlighting the most common feature descriptors they operate on.
    Pontryagin Optimal Controller via Neural Networks. (arXiv:2212.14566v1 [eess.SY])
    Solving real-world optimal control problems are challenging tasks, as the system dynamics can be highly non-linear or including nonconvex objectives and constraints, while in some cases the dynamics are unknown, making it hard to numerically solve the optimal control actions. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin's Minimum Principle (PMP), and propose a computationally efficient framework NN-PMP. The resulting controller can be implemented for systems with unknown and complex dynamics. It can not only utilize the accurate surrogate models parameterized by neural networks, but also efficiently recover the optimality conditions along with the optimal action sequences via PMP conditions. A toy example on a nonlinear Martian Base operation along with a real-world lossy energy storage arbitrage example demonstrates our proposed NN-PMP is a general and versatile computation tool for finding optimal solutions. Compared with solutions provided by the numerical optimization solver with approximated linear dynamics, NN-PMP achieves more efficient system modeling and higher performance in terms of control objectives.
    Can $5^{\rm th}$ Generation Local Training Methods Support Client Sampling? Yes!. (arXiv:2212.14370v1 [cs.LG])
    The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022) identified four distinct generations of LT methods based on the quality of the provided theoretical communication complexity guarantees. Despite a lot of progress in this area, none of the existing works were able to show that it is theoretically better to employ multiple local gradient-type steps (i.e., to engage in LT) than to rely on a single local gradient-type step only in the important heterogeneous data regime. In a recent breakthrough embodied in their ProxSkip method and its theoretical analysis, Mishchenko et al. (2022) showed that LT indeed leads to provable communication acceleration for arbitrarily heterogeneous data, thus jump-starting the $5^{\rm th}$ generation of LT methods. However, while these latest generation LT methods are compatible with DS, none of them support CS. We resolve this open problem in the affirmative. In order to do so, we had to base our algorithmic development on new algorithmic and theoretical foundations.
    INO: Invariant Neural Operators for Learning Complex Physical Systems with Momentum Conservation. (arXiv:2212.14365v1 [cs.LG])
    Neural operators, which emerge as implicit solution operators of hidden governing equations, have recently become popular tools for learning responses of complex real-world physical systems. Nevertheless, the majority of neural operator applications has thus far been data-driven, which neglects the intrinsic preservation of fundamental physical laws in data. In this paper, we introduce a novel integral neural operator architecture, to learn physical models with fundamental conservation laws automatically guaranteed. In particular, by replacing the frame-dependent position information with its invariant counterpart in the kernel space, the proposed neural operator is by design translation- and rotation-invariant, and consequently abides by the conservation laws of linear and angular momentums. As applications, we demonstrate the expressivity and efficacy of our model in learning complex material behaviors from both synthetic and experimental datasets, and show that, by automatically satisfying these essential physical laws, our learned neural operator is not only generalizable in handling translated and rotated datasets, but also achieves state-of-the-art accuracy and efficiency as compared to baseline neural operator models.
    Long-horizon video prediction using a dynamic latent hierarchy. (arXiv:2212.14376v1 [cs.LG])
    The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) -- a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.
    Learned Hierarchical B-frame Coding with Adaptive Feature Modulation for YUV 4:2:0 Content. (arXiv:2212.14187v1 [cs.CV])
    This paper introduces a learned hierarchical B-frame coding scheme in response to the Grand Challenge on Neural Network-based Video Coding at ISCAS 2023. We address specifically three issues, including (1) B-frame coding, (2) YUV 4:2:0 coding, and (3) content-adaptive variable-rate coding with only one single model. Most learned video codecs operate internally in the RGB domain for P-frame coding. B-frame coding for YUV 4:2:0 content is largely under-explored. In addition, while there have been prior works on variable-rate coding with conditional convolution, most of them fail to consider the content information. We build our scheme on conditional augmented normalized flows (CANF). It features conditional motion and inter-frame codecs for efficient B-frame coding. To cope with YUV 4:2:0 content, two conditional inter-frame codecs are used to process the Y and UV components separately, with the coding of the UV components conditioned additionally on the Y component. Moreover, we introduce adaptive feature modulation in every convolutional layer, taking into account both the content information and the coding levels of B-frames to achieve content-adaptive variable-rate coding. Experimental results show that our model outperforms x265 and the winner of last year's challenge on commonly used datasets in terms of PSNR-YUV.
    Bayesian statistical learning using density operators. (arXiv:2212.14715v1 [math.ST])
    This short study reformulates the statistical Bayesian learning problem using a quantum mechanics framework. Density operators representing ensembles of pure states of sample wave functions are used in place probability densities. We show that such representation allows to formulate the statistical Bayesian learning problem in different coordinate systems on the sample space. We further show that such representation allows to learn projections of density operators using a kernel trick. In particular, the study highlights that decomposing wave functions rather than probability densities, as it is done in kernel embedding, allows to preserve the nature of probability operators. Results are illustrated with a simple example using discrete orthogonal wavelet transform of density operators.
    Relative Probability on Finite Outcome Spaces: A Systematic Examination of its Axiomatization, Properties, and Applications. (arXiv:2212.14555v1 [stat.ML])
    This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
    Constant Approximation for Normalized Modularity and Associations Clustering. (arXiv:2212.14334v1 [cs.DS])
    We study the problem of graph clustering under a broad class of objectives in which the quality of a cluster is defined based on the ratio between the number of edges in the cluster, and the total weight of vertices in the cluster. We show that our definition is closely related to popular clustering measures, namely normalized associations, which is a dual of the normalized cut objective, and normalized modularity. We give a linear time constant-approximate algorithm for our objective, which implies the first constant-factor approximation algorithms for normalized modularity and normalized associations.
    Delving into Semantic Scale Imbalance. (arXiv:2212.14613v1 [cs.CV])
    Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias.
    Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games. (arXiv:2212.14449v1 [math.OC])
    Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games in literature. However, limiting applicability, existing theoretical results assume variations of a "population generative model", which allows arbitrary modifications of the population distribution by the learning algorithm. Instead, we show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample trajectory without a population generative model, up to a standard $\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a divergent approach from literature, instead of working with the best-response map we first show that a policy mirror ascent map can be used to construct a contractive operator having the Nash equilibrium as its fixed point. Next, we prove that conditional TD-learning in $N$-agent games can learn value functions within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ time steps. These results allow proving sample complexity guarantees in the oracle-free setting by only relying on a sample path from the $N$ agent simulator. Furthermore, we demonstrate that our methodology allows for independent learning by $N$ agents with finite sample guarantees.
    POMRL: No-Regret Learning-to-Plan with Increasing Horizons. (arXiv:2212.14530v1 [cs.AI])
    We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.
    An Entropy-Based Model for Hierarchical Learning. (arXiv:2212.14681v1 [stat.ML])
    Machine learning is the dominant approach to artificial intelligence, through which computers learn from data and experience. In the framework of supervised learning, for a computer to learn from data accurately and efficiently, some auxiliary information about the data distribution and target function should be provided to it through the learning model. This notion of auxiliary information relates to the concept of regularization in statistical learning theory. A common feature among real-world datasets is that data domains are multiscale and target functions are well-behaved and smooth. In this paper, we propose a learning model that exploits this multiscale data structure and discuss its statistical and computational benefits. The hierarchical learning model is inspired by the logical and progressive easy-to-hard learning mechanism of human beings and has interpretable levels. The model apportions computational resources according to the complexity of data instances and target functions. This property can have multiple benefits, including higher inference speed and computational savings in training a model for many users or when training is interrupted. We provide a statistical analysis of the learning mechanism using multiscale entropies and show that it can yield significantly stronger guarantees than uniform convergence bounds.
    Eliminating Meta Optimization Through Self-Referential Meta Learning. (arXiv:2212.14392v1 [cs.LG])
    Meta Learning automates the search for learning algorithms. At the same time, it creates a dependency on human engineering on the meta-level, where meta learning algorithms need to be designed. In this paper, we investigate self-referential meta learning systems that modify themselves without the need for explicit meta optimization. We discuss the relationship of such systems to in-context and memory-based meta learning and show that self-referential neural networks require functionality to be reused in the form of parameter sharing. Finally, we propose fitness monotonic execution (FME), a simple approach to avoid explicit meta optimization. A neural network self-modifies to solve bandit and classic control tasks, improves its self-modifications, and learns how to learn, purely by assigning more computational resources to better performing solutions.
    A Data-Adaptive Prior for Bayesian Learning of Kernels in Operators. (arXiv:2212.14163v1 [stat.ML])
    Kernels are efficient in representing nonlocal dependence and they are widely used to design operators between function spaces. Thus, learning kernels in operators from data is an inverse problem of general interest. Due to the nonlocal dependence, the inverse problem can be severely ill-posed with a data-dependent singular inversion operator. The Bayesian approach overcomes the ill-posedness through a non-degenerate prior. However, a fixed non-degenerate prior leads to a divergent posterior mean when the observation noise becomes small, if the data induces a perturbation in the eigenspace of zero eigenvalues of the inversion operator. We introduce a data-adaptive prior to achieve a stable posterior whose mean always has a small noise limit. The data-adaptive prior's covariance is the inversion operator with a hyper-parameter selected adaptive to data by the L-curve method. Furthermore, we provide a detailed analysis on the computational practice of the data-adaptive prior, and demonstrate it on Toeplitz matrices and integral operators. Numerical tests show that a fixed prior can lead to a divergent posterior mean in the presence of any of the four types of errors: discretization error, model error, partial observation and wrong noise assumption. In contrast, the data-adaptive prior always attains posterior means with small noise limits.
    A novel cluster internal evaluation index based on hyper-balls. (arXiv:2212.14524v1 [cs.LG])
    It is crucial to evaluate the quality and determine the optimal number of clusters in cluster analysis. In this paper, the multi-granularity characterization of the data set is carried out to obtain the hyper-balls. The cluster internal evaluation index based on hyper-balls(HCVI) is defined. Moreover, a general method for determining the optimal number of clusters based on HCVI is proposed. The proposed methods can evaluate the clustering results produced by the several classic methods and determine the optimal cluster number for data sets containing noises and clusters with arbitrary shapes. The experimental results on synthetic and real data sets indicate that the new index outperforms existing ones.
    Hierarchical Deep Reinforcement Learning for VWAP Strategy Optimization. (arXiv:2212.14670v1 [q-fin.TR])
    Designing an intelligent volume-weighted average price (VWAP) strategy is a critical concern for brokers, since traditional rule-based strategies are relatively static that cannot achieve a lower transaction cost in a dynamic market. Many studies have tried to minimize the cost via reinforcement learning, but there are bottlenecks in improvement, especially for long-duration strategies such as the VWAP strategy. To address this issue, we propose a deep learning and hierarchical reinforcement learning jointed architecture termed Macro-Meta-Micro Trader (M3T) to capture market patterns and execute orders from different temporal scales. The Macro Trader first allocates a parent order into tranches based on volume profiles as the traditional VWAP strategy does, but a long short-term memory neural network is used to improve the forecasting accuracy. Then the Meta Trader selects a short-term subgoal appropriate to instant liquidity within each tranche to form a mini-tranche. The Micro Trader consequently extracts the instant market state and fulfils the subgoal with the lowest transaction cost. Our experiments over stocks listed on the Shanghai stock exchange demonstrate that our approach outperforms baselines in terms of VWAP slippage, with an average cost saving of 1.16 base points compared to the optimal baseline.
    Improving a sequence-to-sequence nlp model using a reinforcement learning policy algorithm. (arXiv:2212.14117v1 [cs.CL])
    Nowadays, the current neural network models of dialogue generation(chatbots) show great promise for generating answers for chatty agents. But they are short-sighted in that they predict utterances one at a time while disregarding their impact on future outcomes. Modelling a dialogue's future direction is critical for generating coherent, interesting dialogues, a need that has led traditional NLP dialogue models that rely on reinforcement learning. In this article, we explain how to combine these objectives by using deep reinforcement learning to predict future rewards in chatbot dialogue. The model simulates conversations between two virtual agents, with policy gradient methods used to reward sequences that exhibit three useful conversational characteristics: the flow of informality, coherence, and simplicity of response (related to forward-looking function). We assess our model based on its diversity, length, and complexity with regard to humans. In dialogue simulation, evaluations demonstrated that the proposed model generates more interactive responses and encourages a more sustained successful conversation. This work commemorates a preliminary step toward developing a neural conversational model based on the long-term success of dialogues.
    An Experience-based Direct Generation approach to Automatic Image Cropping. (arXiv:2212.14561v1 [cs.CV])
    Automatic Image Cropping is a challenging task with many practical downstream applications. The task is often divided into sub-problems - generating cropping candidates, finding the visually important regions, and determining aesthetics to select the most appealing candidate. Prior approaches model one or more of these sub-problems separately, and often combine them sequentially. We propose a novel convolutional neural network (CNN) based method to crop images directly, without explicitly modeling image aesthetics, evaluating multiple crop candidates, or detecting visually salient regions. Our model is trained on a large dataset of images cropped by experienced editors and can simultaneously predict bounding boxes for multiple fixed aspect ratios. We consider the aspect ratio of the cropped image to be a critical factor that influences aesthetics. Prior approaches for automatic image cropping, did not enforce the aspect ratio of the outputs, likely due to a lack of datasets for this task. We, therefore, benchmark our method on public datasets for two related tasks - first, aesthetic image cropping without regard to aspect ratio, and second, thumbnail generation that requires fixed aspect ratio outputs, but where aesthetics are not crucial. We show that our strategy is competitive with or performs better than existing methods in both these tasks. Furthermore, our one-stage model is easier to train and significantly faster than existing two-stage or end-to-end methods for inference. We present a qualitative evaluation study, and find that our model is able to generalize to diverse images from unseen datasets and often retains compositional properties of the original images after cropping. Our results demonstrate that explicitly modeling image aesthetics or visual attention regions is not necessarily required to build a competitive image cropping algorithm.
    On Learning the Structure of Clusters in Graphs. (arXiv:2212.14345v1 [cs.DS])
    Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
    Pensieve 5G: Implementation of RL-based ABR Algorithm for UHD 4K/8K Content Delivery on Commercial 5G SA/NR-DC Network. (arXiv:2212.14479v1 [cs.NI])
    While the rollout of the fifth-generation mobile network (5G) is underway across the globe with the intention to deliver 4K/8K UHD videos, Augmented Reality (AR), and Virtual Reality (VR) content to the mass amounts of users, the coverage and throughput are still one of the most significant issues, especially in the rural areas, where only 5G in the low-frequency band are being deployed. This called for a high-performance adaptive bitrate (ABR) algorithm that can maximize the user quality of experience given 5G network characteristics and data rate of UHD contents. Recently, many of the newly proposed ABR techniques were machine-learning based. Among that, Pensieve is one of the state-of-the-art techniques, which utilized reinforcement-learning to generate an ABR algorithm based on observation of past decision performance. By incorporating the context of the 5G network and UHD content, Pensieve has been optimized into Pensieve 5G. New QoE metrics that more accurately represent the QoE of UHD video streaming on the different types of devices were proposed and used to evaluate Pensieve 5G against other ABR techniques including the original Pensieve. The results from the simulation based on the real 5G Standalone (SA) network throughput shows that Pensieve 5G outperforms both conventional algorithms and Pensieve with the average QoE improvement of 8.8% and 14.2%, respectively. Additionally, Pensieve 5G also performed well on the commercial 5G NR-NR Dual Connectivity (NR-DC) Network, despite the training being done solely using the data from the 5G Standalone (SA) network.
    Estimating Latent Population Flows from Aggregated Data via Inversing Multi-Marginal Optimal Transport. (arXiv:2212.14527v1 [cs.LG])
    We study the problem of estimating latent population flows from aggregated count data. This problem arises when individual trajectories are not available due to privacy issues or measurement fidelity. Instead, the aggregated observations are measured over discrete-time points, for estimating the population flows among states. Most related studies tackle the problems by learning the transition parameters of a time-homogeneous Markov process. Nonetheless, most real-world population flows can be influenced by various uncertainties such as traffic jam and weather conditions. Thus, in many cases, a time-homogeneous Markov model is a poor approximation of the much more complex population flows. To circumvent this difficulty, we resort to a multi-marginal optimal transport (MOT) formulation that can naturally represent aggregated observations with constrained marginals, and encode time-dependent transition matrices by the cost functions. In particular, we propose to estimate the transition flows from aggregated data by learning the cost functions of the MOT framework, which enables us to capture time-varying dynamic patterns. The experiments demonstrate the improved accuracy of the proposed algorithms than the related methods in estimating several real-world transition flows.
    Customizing Knowledge Graph Embedding to Improve Clinical Study Recommendation. (arXiv:2212.14102v1 [cs.LG])
    Inferring knowledge from clinical trials using knowledge graph embedding is an emerging area. However, customizing graph embeddings for different use cases remains a significant challenge. We propose custom2vec, an algorithmic framework to customize graph embeddings by incorporating user preferences in training the embeddings. It captures user preferences by adding custom nodes and links derived from manually vetted results of a separate information retrieval method. We propose a joint learning objective to preserve the original network structure while incorporating the user's custom annotations. We hypothesize that the custom training improves user-expected predictions, for example, in link prediction tasks. We demonstrate the effectiveness of custom2vec for clinical trials related to non-small cell lung cancer (NSCLC) with two customization scenarios: recommending immuno-oncology trials evaluating PD-1 inhibitors and exploring similar trials that compare new therapies with a standard of care. The results show that custom2vec training achieves better performance than the conventional training methods. Our approach is a novel way to customize knowledge graph embeddings and enable more accurate recommendations and predictions.
    Label-Efficient Interactive Time-Series Anomaly Detection. (arXiv:2212.14621v1 [cs.LG])
    Time-series anomaly detection is an important task and has been widely applied in the industry. Since manual data annotation is expensive and inefficient, most applications adopt unsupervised anomaly detection methods, but the results are usually sub-optimal and unsatisfactory to end customers. Weak supervision is a promising paradigm for obtaining considerable labels in a low-cost way, which enables the customers to label data by writing heuristic rules rather than annotating each instance individually. However, in the time-series domain, it is hard for people to write reasonable labeling functions as the time-series data is numerically continuous and difficult to be understood. In this paper, we propose a Label-Efficient Interactive Time-Series Anomaly Detection (LEIAD) system, which enables a user to improve the results of unsupervised anomaly detection by performing only a small amount of interactions with the system. To achieve this goal, the system integrates weak supervision and active learning collaboratively while generating labeling functions automatically using only a few labeled data. All of these techniques are complementary and can promote each other in a reinforced manner. We conduct experiments on three time-series anomaly detection datasets, demonstrating that the proposed system is superior to existing solutions in both weak supervision and active learning areas. Also, the system has been tested in a real scenario in industry to show its practicality.
    Learning One Abstract Bit at a Time Through Self-Invented Experiments Encoded as Neural Networks. (arXiv:2212.14374v1 [cs.LG])
    There are two important things in science: (A) Finding answers to given questions, and (B) Coming up with good questions. Our artificial scientists not only learn to answer given questions, but also continually invent new questions, by proposing hypotheses to be verified or falsified through potentially complex and time-consuming experiments, including thought experiments akin to those of mathematicians. While an artificial scientist expands its knowledge, it remains biased towards the simplest, least costly experiments that still have surprising outcomes, until they become boring. We present an empirical analysis of the automatic generation of interesting experiments. In the first setting, we investigate self-invented experiments in a reinforcement-providing environment and show that they lead to effective exploration. In the second setting, pure thought experiments are implemented as the weights of recurrent neural networks generated by a neural experiment generator. Initially interesting thought experiments may become boring over time.
    Learning Multimodal Data Augmentation in Feature Space. (arXiv:2212.14453v1 [cs.LG])
    The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.
    A Dynamics Theory of Implicit Regularization in Deep Low-Rank Matrix Factorization. (arXiv:2212.14150v1 [cs.LG])
    Implicit regularization is an important way to interpret neural networks. Recent theory starts to explain implicit regularization with the model of deep matrix factorization (DMF) and analyze the trajectory of discrete gradient dynamics in the optimization process. These discrete gradient dynamics are relatively small but not infinitesimal, thus fitting well with the practical implementation of neural networks. Currently, discrete gradient dynamics analysis has been successfully applied to shallow networks but encounters the difficulty of complex computation for deep networks. In this work, we introduce another discrete gradient dynamics approach to explain implicit regularization, i.e. landscape analysis. It mainly focuses on gradient regions, such as saddle points and local minima. We theoretically establish the connection between saddle point escaping (SPE) stages and the matrix rank in DMF. We prove that, for a rank-R matrix reconstruction, DMF will converge to a second-order critical point after R stages of SPE. This conclusion is further experimentally verified on a low-rank matrix reconstruction problem. This work provides a new theory to analyze implicit regularization in deep learning.
    Restricting to the chip architecture maintains the quantum neural network accuracy, if the parameterization is a $2$-design. (arXiv:2212.14426v1 [quant-ph])
    In the era of noisy intermediate scale quantum devices, variational quantum circuits (VQCs) are currently one of the main strategies for building quantum machine learning models. These models are made up of a quantum part and a classical part. The quantum part is given by a parametrization $U$, which, in general, is obtained from the product of different quantum gates. By its turn, the classical part corresponds to an optimizer that updates the parameters of $U$ in order to minimize a cost function $C$. However, despite the many applications of VQCs, there are still questions to be answered, such as for example: What is the best sequence of gates to be used? How to optimize their parameters? Which cost function to use? How the architecture of the quantum chips influences the final results? In this article, we focus on answering the last question. We will show that, in general, the cost function will tend to a typical average value the closer the parameterization used is from a $2$-design. Therefore, the closer this parameterization is to a $2$-design, the less the result of the quantum neural network model will depend on its parametrization. As a consequence, we can use the own architecture of the quantum chips to defined the VQC parametrization, avoiding the use of additional swap gates and thus diminishing the VQC depth and the associated errors.
    A Multiagent Framework for the Asynchronous and Collaborative Extension of Multitask ML Systems. (arXiv:2209.14745v2 [cs.LG] UPDATED)
    The traditional ML development methodology does not enable a large number of contributors, each with distinct objectives, to work collectively on the creation and extension of a shared intelligent system. Enabling such a collaborative methodology can accelerate the rate of innovation, increase ML technologies accessibility and enable the emergence of novel capabilities. We believe that this novel methodology for ML development can be demonstrated through a modularized representation of ML models and the definition of novel abstractions allowing to implement and execute diverse methods for the asynchronous use and extension of modular intelligent systems. We present a multiagent framework for the collaborative and asynchronous extension of dynamic large-scale multitask systems.
    Detection of out-of-distribution samples using binary neuron activation patterns. (arXiv:2212.14268v1 [cs.LG])
    Deep neural networks (DNN) have outstanding performance in various applications. Despite numerous efforts of the research community, out-of-distribution (OOD) samples remain significant limitation of DNN classifiers. The ability to identify previously unseen inputs as novel is crucial in safety-critical applications such as self-driving cars, unmanned aerial vehicles and robots. Existing approaches to detect OOD samples treat a DNN as a black box and assess the confidence score of the output predictions. Unfortunately, this method frequently fails, because DNN are not trained to reduce their confidence for OOD inputs. In this work, we introduce a novel method for OOD detection. Our method is motivated by theoretical analysis of neuron activation patterns (NAP) in ReLU based architectures. The proposed method does not introduce high computational workload due to the binary representation of the activation patterns extracted from convolutional layers. The extensive empirical evaluation proves its high performance on various DNN architectures and seven image datasets. ion.
    Lookback for Learning to Branch. (arXiv:2206.14987v2 [cs.LG] UPDATED)
    The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.
    CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks. (arXiv:2201.05729v3 [cs.CV] UPDATED)
    Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.
    Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets. (arXiv:2210.05958v2 [cs.CV] UPDATED)
    There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.
    Robust Ranking Explanations. (arXiv:2212.14106v1 [cs.LG])
    Gradient-based explanation is the cornerstone of explainable deep networks, but it has been shown to be vulnerable to adversarial attacks. However, existing works measure the explanation robustness based on $\ell_p$-norm, which can be counter-intuitive to humans, who only pay attention to the top few salient features. We propose explanation ranking thickness as a more suitable explanation robustness metric. We then present a new practical adversarial attacking goal for manipulating explanation rankings. To mitigate the ranking-based attacks while maintaining computational feasibility, we derive surrogate bounds of the thickness that involve expensive sampling and integration. We use a multi-objective approach to analyze the convergence of a gradient-based attack to confirm that the explanation robustness can be measured by the thickness metric. We conduct experiments on various network architectures and diverse datasets to prove the superiority of the proposed methods, while the widely accepted Hessian-based curvature smoothing approaches are not as robust as our method.
    Structural State Translation: Condition Transfer between Civil Structures Using Domain-Generalization for Structural Health Monitoring. (arXiv:2212.14048v1 [cs.LG])
    Using Structural Health Monitoring (SHM) systems with extensive sensing arrangements on every civil structure can be costly and impractical. Various concepts have been introduced to alleviate such difficulties, such as Population-based SHM (PBSHM). Nevertheless, the studies presented in the literature do not adequately address the challenge of accessing the information on different structural states (conditions) of dissimilar civil structures. The study herein introduces a novel framework named Structural State Translation (SST), which aims to estimate the response data of different civil structures based on the information obtained from a dissimilar structure. SST can be defined as Translating a state of one civil structure to another state after discovering and learning the domain-invariant representation in the source domains of a dissimilar civil structure. SST employs a Domain-Generalized Cycle-Generative (DGCG) model to learn the domain-invariant representation in the acceleration datasets obtained from a numeric bridge structure that is in two different structural conditions. In other words, the model is tested on three dissimilar numeric bridge models to translate their structural conditions. The evaluation results of SST via Mean Magnitude-Squared Coherence (MMSC) and modal identifiers showed that the translated bridge states (synthetic states) are significantly similar to the real ones. As such, the minimum and maximum average MMSC values of real and translated bridge states are 91.2% and 97.1%, the minimum and the maximum difference in natural frequencies are 5.71% and 0%, and the minimum and maximum Modal Assurance Criterion (MAC) values are 0.998 and 0.870. This study is critical for data scarcity and PBSHM, as it demonstrates that it is possible to obtain data from structures while the structure is actually in a different condition or state.
    A Hypergraph Neural Network Framework for Learning Hyperedge-Dependent Node Embeddings. (arXiv:2212.14077v1 [cs.LG])
    In this work, we introduce a hypergraph representation learning framework called Hypergraph Neural Networks (HNN) that jointly learns hyperedge embeddings along with a set of hyperedge-dependent embeddings for each node in the hypergraph. HNN derives multiple embeddings per node in the hypergraph where each embedding for a node is dependent on a specific hyperedge of that node. Notably, HNN is accurate, data-efficient, flexible with many interchangeable components, and useful for a wide range of hypergraph learning tasks. We evaluate the effectiveness of the HNN framework for hyperedge prediction and hypergraph node classification. We find that HNN achieves an overall mean gain of 7.72% and 11.37% across all baseline models and graphs for hyperedge prediction and hypergraph node classification, respectively.
    Maximizing Use-Case Specificity through Precision Model Tuning. (arXiv:2212.14206v1 [cs.CL])
    Language models have become increasingly popular in recent years for tasks like information retrieval. As use-cases become oriented toward specific domains, fine-tuning becomes default for standard performance. To fine-tune these models for specific tasks and datasets, it is necessary to carefully tune the model's hyperparameters and training techniques. In this paper, we present an in-depth analysis of the performance of four transformer-based language models on the task of biomedical information retrieval. The models we consider are DeepMind's RETRO (7B parameters), GPT-J (6B parameters), GPT-3 (175B parameters), and BLOOM (176B parameters). We compare their performance on the basis of relevance, accuracy, and interpretability, using a large corpus of 480000 research papers on protein structure/function prediction as our dataset. Our findings suggest that smaller models, with <10B parameters and fine-tuned on domain-specific datasets, tend to outperform larger language models on highly specific questions in terms of accuracy, relevancy, and interpretability by a significant margin (+50% on average). However, larger models do provide generally better results on broader prompts.
    Auditing the Imputation Effect on Fairness of Predictive Analytics in Higher Education. (arXiv:2109.07908v2 [cs.CY] UPDATED)
    Colleges and universities use predictive analytics in a variety of ways to increase student success rates. Despite the potential for predictive analytics, two major barriers exist to their adoption in higher education: (a) the lack of democratization in deployment, and (b) the potential to exacerbate inequalities. Education researchers and policymakers encounter numerous challenges in deploying predictive modeling in practice. These challenges present in different steps of modeling including data preparation, model development, and evaluation. Nevertheless, each of these steps can introduce additional bias to the system if not appropriately performed. Most large-scale and nationally representative education data sets suffer from a significant number of incomplete responses from the research participants. While many education-related studies addressed the challenges of missing data, little is known about the impact of handling missing values on the fairness of predictive outcomes in practice. In this paper, we set out to first assess the disparities in predictive modeling outcomes for college-student success, then investigate the impact of imputation techniques on the model performance and fairness using a commonly used set of metrics. We conduct a prospective evaluation to provide a less biased estimation of future performance and fairness than an evaluation of historical data. Our comprehensive analysis of a real large-scale education dataset reveals key insights on modeling disparities and how imputation techniques impact the fairness of the student-success predictive outcome under different testing scenarios. Our results indicate that imputation introduces bias if the testing set follows the historical distribution. However, if the injustice in society is addressed and consequently the upcoming batch of observations is equalized, the model would be less biased.
    Are Deep Image Embedding Clustering Methods Effective for Heterogeneous Tabular Data?. (arXiv:2212.14111v1 [cs.LG])
    Deep learning methods in the literature are invariably benchmarked on image data sets and then assumed to work on all data problems. Unfortunately, architectures designed for image learning are often not ready or optimal for non-image data without considering data-specific learning requirements. In this paper, we take a data-centric view to argue that deep image embedding clustering methods are not equally effective on heterogeneous tabular data sets. This paper performs one of the first studies on deep embedding clustering of seven tabular data sets using six state-of-the-art baseline methods proposed for image data sets. Our results reveal that the traditional clustering of tabular data ranks second out of eight methods and is superior to most deep embedding clustering baselines. Our observation is in line with the recent literature that traditional machine learning of tabular data is still a competitive approach against deep learning. Although surprising to many deep learning researchers, traditional clustering methods can be competitive baselines for tabular data, and outperforming these baselines remains a challenge for deep embedding clustering. Therefore, deep learning methods for image learning may not be fair or suitable baselines for tabular data without considering data-specific contrasts and learning requirements.
    Quantum-Inspired Tensor Neural Networks for Option Pricing. (arXiv:2212.14076v1 [q-fin.PR])
    Recent advances in deep learning have enabled us to address the curse of dimensionality (COD) by solving problems in higher dimensions. A subset of such approaches of addressing the COD has led us to solving high-dimensional PDEs. This has resulted in opening doors to solving a variety of real-world problems ranging from mathematical finance to stochastic control for industrial applications. Although feasible, these deep learning methods are still constrained by training time and memory. Tackling these shortcomings, Tensor Neural Networks (TNN) demonstrate that they can provide significant parameter savings while attaining the same accuracy as compared to the classical Dense Neural Network (DNN). In addition, we also show how TNN can be trained faster than DNN for the same accuracy. Besides TNN, we also introduce Tensor Network Initializer (TNN Init), a weight initialization scheme that leads to faster convergence with smaller variance for an equivalent parameter count as compared to a DNN. We benchmark TNN and TNN Init by applying them to solve the parabolic PDE associated with the Heston model, which is widely used in financial pricing theory.
    Normalizing Flows for Hierarchical Bayesian Analysis: A Gravitational Wave Population Study. (arXiv:2211.09008v3 [astro-ph.IM] UPDATED)
    We propose parameterizing the population distribution of the gravitational wave population modeling framework (Hierarchical Bayesian Analysis) with a normalizing flow. We first demonstrate the merit of this method on illustrative experiments and then analyze four parameters of the latest LIGO/Virgo data release: primary mass, secondary mass, redshift, and effective spin. Our results show that despite the small and notoriously noisy dataset, the posterior predictive distributions (assuming a prior over the parameters of the flow) of the observed gravitational wave population recover structure that agrees with robust previous phenomenological modeling results while being less susceptible to biases introduced by less flexible models. Therefore, the method forms a promising flexible, reliable replacement for population inference distributions, even when data is highly noisy.
    Decentralized Learning with Separable Data: Generalization and Fast Algorithms. (arXiv:2209.07116v3 [cs.LG] UPDATED)
    Decentralized learning offers privacy and communication efficiency when data are naturally distributed among agents communicating over an underlying graph. Motivated by overparameterized learning settings, in which models are trained to zero training loss, we study algorithmic and generalization properties of decentralized learning with gradient descent on separable data. Specifically, for decentralized gradient descent (DGD) and a variety of loss functions that asymptote to zero at infinity (including exponential and logistic losses), we derive novel finite-time generalization bounds. This complements a long line of recent work that studies the generalization performance and the implicit bias of gradient descent over separable data, but has thus far been limited to centralized learning scenarios. Notably, our generalization bounds approximately match in order their centralized counterparts. Critical behind this, and of independent interest, is establishing novel bounds on the training loss and the rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the algorithmic front, we design improved gradient-based routines for decentralized learning with separable data and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.
    Taylor-Lagrange Neural Ordinary Differential Equations: Toward Fast Training and Evaluation of Neural ODEs. (arXiv:2201.05715v2 [cs.LG] UPDATED)
    Neural ordinary differential equations (NODEs) -- parametrizations of differential equations using neural networks -- have shown tremendous promise in learning models of unknown continuous-time dynamical systems from data. However, every forward evaluation of a NODE requires numerical integration of the neural network used to capture the system dynamics, making their training prohibitively expensive. Existing works rely on off-the-shelf adaptive step-size numerical integration schemes, which often require an excessive number of evaluations of the underlying dynamics network to obtain sufficient accuracy for training. By contrast, we accelerate the evaluation and the training of NODEs by proposing a data-driven approach to their numerical integration. The proposed Taylor-Lagrange NODEs (TL-NODEs) use a fixed-order Taylor expansion for numerical integration, while also learning to estimate the expansion's approximation error. As a result, the proposed approach achieves the same accuracy as adaptive step-size schemes while employing only low-order Taylor expansions, thus greatly reducing the computational cost necessary to integrate the NODE. A suite of numerical experiments, including modeling dynamical systems, image classification, and density estimation, demonstrate that TL-NODEs can be trained more than an order of magnitude faster than state-of-the-art approaches, without any loss in performance.
    Search Efficient Binary Network Embedding. (arXiv:1901.04097v2 [cs.SI] UPDATED)
    Traditional network embedding primarily focuses on learning a continuous vector representation for each node, preserving network structure and/or node content information, such that off-the-shelf machine learning algorithms can be easily applied to the vector-format node representations for network analysis. However, the learned continuous vector representations are inefficient for large-scale similarity search, which often involves finding nearest neighbors measured by distance or similarity in a continuous vector space. In this paper, we propose a search efficient binary network embedding algorithm called BinaryNE to learn a binary code for each node, by simultaneously modeling node context relations and node attribute relations through a three-layer neural network. BinaryNE learns binary node representations through a stochastic gradient descent based online learning algorithm. The learned binary encoding not only reduces memory usage to represent each node, but also allows fast bit-wise comparisons to support faster node similarity search than using Euclidean distance or other distance measures. Extensive experiments and comparisons demonstrate that BinaryNE not only delivers more than 25 times faster search speed, but also provides comparable or better search quality than traditional continuous vector based network embedding methods. The binary codes learned by BinaryNE also render competitive performance on node classification and node clustering tasks. The source code of this paper is available at https://github.com/daokunzhang/BinaryNE.
    CGX: Adaptive System Support for Communication-Efficient Deep Learning. (arXiv:2111.08617v5 [cs.DC] UPDATED)
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth overprovisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: \emph{At the system level}, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. \emph{At the application level}, it provides \emph{seamless, parameter-free} integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a \emph{layer-wise adaptive compression} technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
    On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. (arXiv:2003.12408v2 [stat.ML] UPDATED)
    In many investigations, the primary outcome of interest is difficult or expensive to collect. Examples include long-term health effects of medical interventions, measurements requiring expensive testing or follow-up, and outcomes only measurable on small panels as in marketing. This reduces effective sample sizes for estimating the average treatment effect (ATE). However, there is often an abundance of observations on surrogate outcomes not of primary interest, such as short-term health effects or online-ad click-through. We study the role of such surrogate observations in the efficient estimation of treatment effects. To quantify their value, we derive the semiparametric efficiency bounds on ATE estimation with and without the presence of surrogates and several intermediary settings. The difference between these characterizes the efficiency gains from optimally leveraging surrogates. We study two regimes: when the number of surrogate observations is comparable to primary-outcome observations and when the former dominates the latter. We take an agnostic missing-data approach circumventing strong surrogate conditions previously assumed. To leverage surrogates' efficiency gains, we develop efficient ATE estimation and inference based on flexible machine-learning estimates of nuisance functions appearing in the influence functions we derive. We empirically demonstrate the gains by studying the long-term earnings effect of job training.
    Multi-Modal Foundation Model for Simultaneous Comprehension of Molecular Structure and Properties. (arXiv:2211.10590v2 [cs.LG] UPDATED)
    Recently, deep learning approaches have been extensively studied for various problems in chemistry, such as property prediction, virtual screening, de novo molecule design, etc. Despite the impressive successes, separately designed networks for specific tasks are usually required for end-to-end training, so it is often difficult to acquire a unified principle to synergistically combine existing models and training datasets for novel tasks. To address this, here we present a novel multimodal chemical foundation model that can be used for various downstream tasks that require a simultaneous understanding of structure and property. Specifically, inspired by recent advances in pre-trained multi-modal foundation models such as Vision-Language Pretrained models (VLP), we proposed a novel structure-property multi-modal (SPMM) foundation model using the dual-stream transformer with X-shape attention, so that it can align the molecule structure and the chemical properties in a common embedding space. Thanks to the outstanding structure-property unimodal representation, experimental results confirm that SPMM can simultaneously perform molecule generation, property prediction, classification, reaction prediction, etc., which was previously not possible with a single architecture.
    Differentiating Student Feedbacks for Knowledge Tracing. (arXiv:2212.14695v1 [cs.CY])
    In computer-aided education and intelligent tutoring systems, knowledge tracing (KT) raises attention due to the development of data-driven learning methods, which aims to predict students' future performance given their past question response sequences to trace their knowledge states. However, current deep learning approaches only focus on enhancing prediction accuracy, but neglecting the discrimination imbalance of responses. That is, a considerable proportion of question responses are weak to discriminate students' knowledge states, but equally considered compared to other discriminative responses, thus hurting the ability of tracing students' personalized knowledge states. To tackle this issue, we propose DR4KT for Knowledge Tracing, which reweights the contribution of different responses according to their discrimination in training. For retaining high prediction accuracy on low discriminative responses after reweighting, DR4KT also introduces a discrimination-aware score fusion technique to make a proper combination between student knowledge mastery and the questions themselves. Comprehensive experimental results show that our DR4KT applied on four mainstream KT methods significantly improves their performance on three widely-used datasets.
    Ensemble of Loss Functions to Improve Generalizability of Deep Metric Learning methods. (arXiv:2107.01130v2 [cs.CV] UPDATED)
    Deep Metric Learning (DML) learns a non-linear semantic embedding from input data that brings similar pairs together while keeping dissimilar data away from each other. To this end, many different methods are proposed in the last decade with promising results in various applications. The success of a DML algorithm greatly depends on its loss function. However, no loss function is perfect, and it deals only with some aspects of an optimal similarity embedding. Besides, the generalizability of the DML on unseen categories during the test stage is an important matter that is not considered by existing loss functions. To address these challenges, we propose novel approaches to combine different losses built on top of a shared deep feature extractor. The proposed ensemble of losses enforces the deep model to extract features that are consistent with all losses. Since the selected losses are diverse and each emphasizes different aspects of an optimal semantic embedding, our effective combining methods yield a considerable improvement over any individual loss and generalize well on unseen categories. Here, there is no limitation in choosing loss functions, and our methods can work with any set of existing ones. Besides, they can optimize each loss function as well as its weight in an end-to-end paradigm with no need to adjust any hyper-parameter. We evaluate our methods on some popular datasets from the machine vision domain in conventional Zero-Shot-Learning (ZSL) settings. The results are very encouraging and show that our methods outperform all baseline losses by a large margin in all datasets.
    Complementary Calibration: Boosting General Continual Learning with Collaborative Distillation and Self-Supervision. (arXiv:2109.02426v2 [cs.CV] UPDATED)
    General Continual Learning (GCL) aims at learning from non independent and identically distributed stream data without catastrophic forgetting of the old tasks that don't rely on task boundaries during both training and testing stages. We reveal that the relation and feature deviations are crucial problems for catastrophic forgetting, in which relation deviation refers to the deficiency of the relationship among all classes in knowledge distillation, and feature deviation refers to indiscriminative feature representations. To this end, we propose a Complementary Calibration (CoCa) framework by mining the complementary model's outputs and features to alleviate the two deviations in the process of GCL. Specifically, we propose a new collaborative distillation approach for addressing the relation deviation. It distills model's outputs by utilizing ensemble dark knowledge of new model's outputs and reserved outputs, which maintains the performance of old tasks as well as balancing the relationship among all classes. Furthermore, we explore a collaborative self-supervision idea to leverage pretext tasks and supervised contrastive learning for addressing the feature deviation problem by learning complete and discriminative features for all classes. Extensive experiments on four popular datasets show that our CoCa framework achieves superior performance against state-of-the-art methods. Code is available at https://github.com/lijincm/CoCa.
    Reliable Agglomerative Clustering. (arXiv:1901.02063v5 [cs.LG] UPDATED)
    Standard agglomerative clustering suggests establishing a new reliable linkage at every step. However, in order to provide adaptive, density-consistent and flexible solutions, we study extracting all the reliable linkages at each step, instead of the smallest one. Such a strategy can be applied with all common criteria for agglomerative hierarchical clustering. We also study that this strategy with the single linkage criterion yields a minimum spanning tree algorithm. We perform experiments on several real-world datasets to demonstrate the performance of this strategy compared to the standard alternative.
    Guidance Through Surrogate: Towards a Generic Diagnostic Attack. (arXiv:2212.14875v1 [cs.LG])
    Adversarial training is an effective approach to make deep neural networks robust against adversarial attacks. Recently, different adversarial training defenses are proposed that not only maintain a high clean accuracy but also show significant robustness against popular and well studied adversarial attacks such as PGD. High adversarial robustness can also arise if an attack fails to find adversarial gradient directions, a phenomenon known as `gradient masking'. In this work, we analyse the effect of label smoothing on adversarial training as one of the potential causes of gradient masking. We then develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack approach is based on a `match and deceive' loss that finds optimal adversarial directions through guidance from a surrogate model. Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be combined with an ensemble attack strategy as we demonstrate for the case of Auto-Attack, leading to efficiency and convergence speed improvements. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
    Active Learning of Driving Scenario Trajectories. (arXiv:2108.03217v2 [cs.LG] UPDATED)
    Annotated driving scenario trajectories are crucial for verification and validation of autonomous vehicles. However, annotation of such trajectories based only on explicit rules (i.e. knowledge-based methods) may be prone to errors, such as false positive/negative classification of scenarios that lie on the border of two scenario classes, missing unknown scenario classes, or even failing to detect anomalies. On the other hand, verification of labels by annotators is not cost-efficient. For this purpose, active learning (AL) could potentially improve the annotation procedure by including an annotator/expert in an efficient way. In this study, we develop a generic active learning framework to annotate driving trajectory time series data. We first compute an embedding of the trajectories into a latent space in order to extract the temporal nature of the data. Given such an embedding, the framework becomes task agnostic since active learning can be performed using any classification method and any query strategy, regardless of the structure of the original time series data. Furthermore, we utilize our active learning framework to discover unknown driving scenario trajectories. This will ensure that previously unknown trajectory types can be effectively detected and included in the labeled dataset. We evaluate our proposed framework in different settings on novel real-world datasets consisting of driving trajectories collected by Volvo Cars Corporation. We observe that active learning constitutes an effective tool for labelling driving trajectories as well as for detecting unknown classes. Expectedly, the quality of the embedding plays an important role in the success of the proposed framework.
    Symbolic Visual Reinforcement Learning: A Scalable Framework with Object-Level Abstraction and Differentiable Expression Search. (arXiv:2212.14849v1 [cs.LG])
    Learning efficient and interpretable policies has been a challenging task in reinforcement learning (RL), particularly in the visual RL setting with complex scenes. While neural networks have achieved competitive performance, the resulting policies are often over-parameterized black boxes that are difficult to interpret and deploy efficiently. More recent symbolic RL frameworks have shown that high-level domain-specific programming logic can be designed to handle both policy learning and symbolic planning. However, these approaches rely on coded primitives with little feature learning, and when applied to high-dimensional visual scenes, they can suffer from scalability issues and perform poorly when images have complex object interactions. To address these challenges, we propose \textit{Differentiable Symbolic Expression Search} (DiffSES), a novel symbolic learning approach that discovers discrete symbolic policies using partially differentiable optimization. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions, while also incorporating the strengths of neural networks for feature learning and optimization. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more and scalable than state-of-the-art symbolic RL methods, with a reduced amount of symbolic prior knowledge.
    Essential Number of Principal Components and Nearly Training-Free Model for Spectral Analysis. (arXiv:2212.14623v1 [cs.LG])
    Through a study of multi-gas mixture datasets, we show that in multi-component spectral analysis, the number of functional or non-functional principal components required to retain the essential information is the same as the number of independent constituents in the mixture set. Due to the mutual in-dependency among different gas molecules, near one-to-one projection from the principal component to the mixture constituent can be established, leading to a significant simplification of spectral quantification. Further, with the knowledge of the molar extinction coefficients of each constituent, a complete principal component set can be extracted from the coefficients directly, and few to none training samples are required for the learning model. Compared to other approaches, the proposed methods provide fast and accurate spectral quantification solutions with a small memory size needed.
    Condensed Representation of Machine Learning Data. (arXiv:2212.14229v1 [cs.LG])
    Training of a Machine Learning model requires sufficient data. The sufficiency of the data is not always about the quantity, but about the relevancy and reduced redundancy. Data-generating processes create massive amounts of data. When used raw, such big data is causing much computational resource utilization. Instead of using the raw data, a proper Condensed Representation can be used instead. Combining K-means, a well-known clustering method, with some correction and refinement facilities a novel Condensed Representation method for Machine Learning applications is introduced. To present the novel method meaningfully and visually, synthetically generated data is employed. It has been shown that by using the condensed representation, instead of the raw data, acceptably accurate model training is possible.
    Integral Probability Metrics PAC-Bayes Bounds. (arXiv:2207.00614v8 [stat.ML] UPDATED)
    We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and improved bounds in favorable cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
    An Optimal Algorithm for Strongly Convex Min-min Optimization. (arXiv:2212.14439v1 [math.OC])
    In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{\kappa_x,\kappa_y\}} \log 1/\epsilon)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $\kappa_x$ and $\kappa_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires $\mathcal{O}(\sqrt{\kappa_x} \log 1/\epsilon)$ of computations of $\nabla_x f(x,y)$ and $\mathcal{O}(\sqrt{\kappa_y} \log 1/\epsilon)$ computations of $\nabla_y f(x,y)$. In some applications $\kappa_x \gg \kappa_y$, and computation of $\nabla_y f(x,y)$ is significantly cheaper than computation of $\nabla_x f(x,y)$. In this case, our algorithm substantially outperforms the existing state-of-the-art methods.
    Properties of Group Fairness Metrics for Rankings. (arXiv:2212.14351v1 [cs.LG])
    In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.
    Invertible normalizing flow neural networks by JKO scheme. (arXiv:2212.14424v1 [stat.ML])
    Normalizing flow is a class of deep generative models for efficient sampling and density estimation. In practice, the flow often appears as a chain of invertible neural network blocks; to facilitate training, existing works have regularized flow trajectories and designed special network architectures. The current paper develops a neural ODE flow network inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which allows efficient block-wise training of the residual blocks and avoids inner loops of score matching or variational learning. As the JKO scheme unfolds the dynamic of gradient flow, the proposed model naturally stacks residual network blocks one-by-one, reducing the memory load and difficulty of performing end-to-end training of deep flow networks. We also develop adaptive time reparameterization of the flow network with a progressive refinement of the trajectory in probability space, which improves the model training efficiency and accuracy in practice. Using numerical experiments with synthetic and real data, we show that the proposed JKO-iFlow model achieves similar or better performance in generating new samples compared with existing flow and diffusion models at a significantly reduced computational and memory cost.
    Falsification of Learning-Based Controllers through Multi-Fidelity Bayesian Optimization. (arXiv:2212.14118v1 [eess.SY])
    Simulation-based falsification is a practical testing method to increase confidence that the system will meet safety requirements. Because full-fidelity simulations can be computationally demanding, we investigate the use of simulators with different levels of fidelity. As a first step, we express the overall safety specification in terms of environmental parameters and structure this safety specification as an optimization problem. We propose a multi-fidelity falsification framework using Bayesian optimization, which is able to determine at which level of fidelity we should conduct a safety evaluation in addition to finding possible instances from the environment that cause the system to fail. This method allows us to automatically switch between inexpensive, inaccurate information from a low-fidelity simulator and expensive, accurate information from a high-fidelity simulator in a cost-effective way. Our experiments on various environments in simulation demonstrate that multi-fidelity Bayesian optimization has falsification performance comparable to single-fidelity Bayesian optimization but with much lower cost.
    Proof of Swarm Based Ensemble Learning for Federated Learning Applications. (arXiv:2212.14050v1 [cs.LG])
    Ensemble learning combines results from multiple machine learning models in order to provide a better and optimised predictive model with reduced bias, variance and improved predictions. However, in federated learning it is not feasible to apply centralised ensemble learning directly due to privacy concerns. Hence, a mechanism is required to combine results of local models to produce a global model. Most distributed consensus algorithms, such as Byzantine fault tolerance (BFT), do not normally perform well in such applications. This is because, in such methods predictions of some of the peers are disregarded, so a majority of peers can win without even considering other peers' decisions. Additionally, the confidence score of the result of each peer is not normally taken into account, although it is an important feature to consider for ensemble learning. Moreover, the problem of a tie event is often left un-addressed by methods such as BFT. To fill these research gaps, we propose PoSw (Proof of Swarm), a novel distributed consensus algorithm for ensemble learning in a federated setting, which was inspired by particle swarm based algorithms for solving optimisation problems. The proposed algorithm is theoretically proved to always converge in a relatively small number of steps and has mechanisms to resolve tie events while trying to achieve sub-optimum solutions. We experimentally validated the performance of the proposed algorithm using ECG classification as an example application in healthcare, showing that the ensemble learning model outperformed all local models and even the FL-based global model. To the best of our knowledge, the proposed algorithm is the first attempt to make consensus over the output results of distributed models trained using federated learning.
    Power Control for 6G Industrial Wireless Subnetworks: A Graph Neural Network Approach. (arXiv:2212.14051v1 [eess.SP])
    6th Generation (6G) industrial wireless subnetworks are expected to replace wired connectivity for control operation in robots and production modules. Interference management techniques such as centralized power control can improve spectral efficiency in dense deployments of such subnetworks. However, existing solutions for centralized power control may require full channel state information (CSI) of all the desired and interfering links, which may be cumbersome and time-consuming to obtain in dense deployments. This paper presents a novel solution for centralized power control for industrial subnetworks based on Graph Neural Networks (GNNs). The proposed method only requires the subnetwork positioning information, usually known at the central controller, and the knowledge of the desired link channel gain during the execution phase. Simulation results show that our solution achieves similar spectral efficiency as the benchmark schemes requiring full CSI in runtime operations. Also, robustness to changes in the deployment density and environment characteristics with respect to the training phase is verified.  ( 2 min )
    Large-Scale Cell-Level Quality of Service Estimation on 5G Networks Using Machine Learning Techniques. (arXiv:2212.14071v1 [cs.LG])
    This study presents a general machine learning framework to estimate the traffic-measurement-level experience rate at given throughput values in the form of a Key Performance Indicator for the cells on base stations across various cities, using busy-hour counter data, and several technical parameters together with the network topology. Relying on feature engineering techniques, scores of additional predictors are proposed to enhance the effects of raw correlated counter values over the corresponding targets, and to represent the underlying interactions among groups of cells within nearby spatial locations effectively. An end-to-end regression modeling is applied on the transformed data, with results presented on unseen cities of varying sizes.  ( 2 min )
    On Transforming Reinforcement Learning by Transformer: The Development Trajectory. (arXiv:2212.14164v1 [cs.LG])
    Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.  ( 2 min )
    Investigating Sindy As a Tool For Causal Discovery In Time Series Signals. (arXiv:2212.14133v1 [cs.LG])
    The SINDy algorithm has been successfully used to identify the governing equations of dynamical systems from time series data. In this paper, we argue that this makes SINDy a potentially useful tool for causal discovery and that existing tools for causal discovery can be used to dramatically improve the performance of SINDy as tool for robust sparse modeling and system identification. We then demonstrate empirically that augmenting the SINDy algorithm with tools from causal discovery can provides engineers with a tool for learning causally robust governing equations.  ( 2 min )
  • Open

    Invertible normalizing flow neural networks by JKO scheme. (arXiv:2212.14424v1 [stat.ML])
    Normalizing flow is a class of deep generative models for efficient sampling and density estimation. In practice, the flow often appears as a chain of invertible neural network blocks; to facilitate training, existing works have regularized flow trajectories and designed special network architectures. The current paper develops a neural ODE flow network inspired by the Jordan-Kinderleherer-Otto (JKO) scheme, which allows efficient block-wise training of the residual blocks and avoids inner loops of score matching or variational learning. As the JKO scheme unfolds the dynamic of gradient flow, the proposed model naturally stacks residual network blocks one-by-one, reducing the memory load and difficulty of performing end-to-end training of deep flow networks. We also develop adaptive time reparameterization of the flow network with a progressive refinement of the trajectory in probability space, which improves the model training efficiency and accuracy in practice. Using numerical experiments with synthetic and real data, we show that the proposed JKO-iFlow model achieves similar or better performance in generating new samples compared with existing flow and diffusion models at a significantly reduced computational and memory cost.  ( 2 min )
    Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients. (arXiv:2212.14319v1 [stat.ML])
    Partial differential equations (PDEs) are important tools to model physical systems, and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works like a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDE, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.  ( 2 min )
    Boosting Simple Learners. (arXiv:2001.11704v7 [cs.LG] UPDATED)
    Boosting is a celebrated machine learning approach which is based on the idea of combining weak and moderately inaccurate hypotheses to a strong and accurate one. We study boosting under the assumption that the weak hypotheses belong to a class of bounded capacity. This assumption is inspired by the common convention that weak hypotheses are "rules-of-thumbs" from an "easy-to-learn class". (Schapire and Freund~'12, Shalev-Shwartz and Ben-David '14.) Formally, we assume the class of weak hypotheses has a bounded VC dimension. We focus on two main questions: (i) Oracle Complexity: How many weak hypotheses are needed to produce an accurate hypothesis? We design a novel boosting algorithm and demonstrate that it circumvents a classical lower bound by Freund and Schapire ('95, '12). Whereas the lower bound shows that $\Omega({1}/{\gamma^2})$ weak hypotheses with $\gamma$-margin are sometimes necessary, our new method requires only $\tilde{O}({1}/{\gamma})$ weak hypothesis, provided that they belong to a class of bounded VC dimension. Unlike previous boosting algorithms which aggregate the weak hypotheses by majority votes, the new boosting algorithm uses more complex ("deeper") aggregation rules. We complement this result by showing that complex aggregation rules are in fact necessary to circumvent the aforementioned lower bound. (ii) Expressivity: Which tasks can be learned by boosting weak hypotheses from a bounded VC class? Can complex concepts that are "far away" from the class be learned? Towards answering the first question we {introduce combinatorial-geometric parameters which capture expressivity in boosting.} As a corollary we provide an affirmative answer to the second question for well-studied classes, including half-spaces and decision stumps. Along the way, we establish and exploit connections with Discrepancy Theory.
    Posterior sampling with CNN-based, Plug-and-Play regularization with applications to Post-Stack Seismic Inversion. (arXiv:2212.14595v1 [stat.ML])
    Uncertainty quantification is crucial to inverse problems, as it could provide decision-makers with valuable information about the inversion results. For example, seismic inversion is a notoriously ill-posed inverse problem due to the band-limited and noisy nature of seismic data. It is therefore of paramount importance to quantify the uncertainties associated to the inversion process to ease the subsequent interpretation and decision making processes. Within this framework of reference, sampling from a target posterior provides a fundamental approach to quantifying the uncertainty in seismic inversion. However, selecting appropriate prior information in a probabilistic inversion is crucial, yet non-trivial, as it influences the ability of a sampling-based inference in providing geological realism in the posterior samples. To overcome such limitations, we present a regularized variational inference framework that performs posterior inference by implicitly regularizing the Kullback-Leibler divergence loss with a CNN-based denoiser by means of the Plug-and-Play methods. We call this new algorithm Plug-and-Play Stein Variational Gradient Descent (PnP-SVGD) and demonstrate its ability in producing high-resolution, trustworthy samples representative of the subsurface structures, which we argue could be used for post-inference tasks such as reservoir modelling and history matching. To validate the proposed method, numerical tests are performed on both synthetic and field post-stack seismic data.
    PAC-Bayesian-Like Error Bound for a Class of Linear Time-Invariant Stochastic State-Space Models. (arXiv:2212.14838v1 [stat.ML])
    In this paper we derive a PAC-Bayesian-Like error bound for a class of stochastic dynamical systems with inputs, namely, for linear time-invariant stochastic state-space models (stochastic LTI systems for short). This class of systems is widely used in control engineering and econometrics, in particular, they represent a special case of recurrent neural networks. In this paper we 1) formalize the learning problem for stochastic LTI systems with inputs, 2) derive a PAC-Bayesian-Like error bound for such systems, 3) discuss various consequences of this error bound.
    Heterogeneous Synthetic Learner for Panel Data. (arXiv:2212.14580v1 [stat.ML])
    In the new era of personalization, learning the heterogeneous treatment effect (HTE) becomes an inevitable trend with numerous applications. Yet, most existing HTE estimation methods focus on independently and identically distributed observations and cannot handle the non-stationarity and temporal dependency in the common panel data setting. The treatment evaluators developed for panel data, on the other hand, typically ignore the individualized information. To fill the gap, in this paper, we initialize the study of HTE estimation in panel data. Under different assumptions for HTE identifiability, we propose the corresponding heterogeneous one-side and two-side synthetic learner, namely H1SL and H2SL, by leveraging the state-of-the-art HTE estimator for non-panel data and generalizing the synthetic control method that allows flexible data generating process. We establish the convergence rates of the proposed estimators. The superior performance of the proposed methods over existing ones is demonstrated by extensive numerical studies.
    On Biased Compression for Distributed Learning. (arXiv:2002.12410v3 [cs.LG] UPDATED)
    In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {\em biased} compressors often show superior performance in practice when compared to the much more studied and understood {\em unbiased} compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $\mathcal{O}\left( \delta L \exp[-\frac{\mu K}{\delta L}] + \frac{(C + \delta D)}{K\mu}\right)$, where $\delta\ge1$ is a compression parameter which grows when more compression is applied, $L$ and $\mu$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.
    Lookback for Learning to Branch. (arXiv:2206.14987v2 [cs.LG] UPDATED)
    The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.
    Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data. (arXiv:2212.14165v1 [stat.ME])
    Rapid advancements in collection and dissemination of multi-platform molecular and genomics data has resulted in enormous opportunities to aggregate such data in order to understand, prevent, and treat human diseases. While significant improvements have been made in multi-omic data integration methods to discover biological markers and mechanisms underlying both prognosis and treatment, the precise cellular functions governing these complex mechanisms still need detailed and data-driven de-novo evaluations. We propose a framework called Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data (fiBAG), that allows simultaneous identification of upstream functional evidence of proteogenomic biomarkers and the incorporation of such knowledge in Bayesian variable selection models to improve signal detection. fiBAG employs a conflation of Gaussian process models to quantify (possibly non-linear) functional evidence via Bayes factors, which are then mapped to a novel calibrated spike-and-slab prior, thus guiding selection and providing functional relevance to the associations with patient outcomes. Using simulations, we illustrate how integrative methods with functional calibration have higher power to detect disease related markers than non-integrative approaches. We demonstrate the profitability of fiBAG via a pan-cancer analysis of 14 cancer types to identify and assess the cellular mechanisms of proteogenomic markers associated with cancer stemness and patient survival.
    Decoupled Self-supervised Learning for Graphs. (arXiv:2206.03601v2 [cs.LG] UPDATED)
    This paper studies the problem of conducting self-supervised learning for node representation learning on graphs. Most existing self-supervised learning methods assume the graph is homophilous, where linked nodes often belong to the same class or have similar features. However, such assumptions of homophily do not always hold in real-world graphs. We address this problem by developing a decoupled self-supervised learning (DSSL) framework for graph neural networks. DSSL imitates a generative process of nodes and links from latent variable modeling of the semantic structure, which decouples different underlying semantics between different neighborhoods into the self-supervised learning process. Our DSSL framework is agnostic to the encoders and does not need prefabricated augmentations, thus is flexible to different graphs. To effectively optimize the framework, we derive the evidence lower bound of the self-supervised objective and develop a scalable training algorithm with variational inference. We provide a theoretical analysis to justify that DSSL enjoys the better downstream performance. Extensive experiments on various types of graph benchmarks demonstrate that our proposed framework can achieve better performance compared with competitive baselines.
    The Voronoigram: Minimax Estimation of Bounded Variation Functions From Scattered Data. (arXiv:2212.14514v1 [stat.ML])
    We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $\theta_i,\theta_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded.
    A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness. (arXiv:2205.00403v2 [cs.LG] UPDATED)
    Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines
    Integral Probability Metrics PAC-Bayes Bounds. (arXiv:2207.00614v8 [stat.ML] UPDATED)
    We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and improved bounds in favorable cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
    INO: Invariant Neural Operators for Learning Complex Physical Systems with Momentum Conservation. (arXiv:2212.14365v1 [cs.LG])
    Neural operators, which emerge as implicit solution operators of hidden governing equations, have recently become popular tools for learning responses of complex real-world physical systems. Nevertheless, the majority of neural operator applications has thus far been data-driven, which neglects the intrinsic preservation of fundamental physical laws in data. In this paper, we introduce a novel integral neural operator architecture, to learn physical models with fundamental conservation laws automatically guaranteed. In particular, by replacing the frame-dependent position information with its invariant counterpart in the kernel space, the proposed neural operator is by design translation- and rotation-invariant, and consequently abides by the conservation laws of linear and angular momentums. As applications, we demonstrate the expressivity and efficacy of our model in learning complex material behaviors from both synthetic and experimental datasets, and show that, by automatically satisfying these essential physical laws, our learned neural operator is not only generalizable in handling translated and rotated datasets, but also achieves state-of-the-art accuracy and efficiency as compared to baseline neural operator models.
    Unsupervised Representation Learning with Minimax Distance Measures. (arXiv:1904.13223v3 [cs.LG] UPDATED)
    We investigate the use of Minimax distances to extract in a nonparametric way the features that capture the unknown underlying patterns and structures in the data. We develop a general-purpose and computationally efficient framework to employ Minimax distances with many machine learning methods that perform on numerical data. We study both computing the pairwise Minimax distances for all pairs of objects and as well as computing the Minimax distances of all the objects to/from a fixed (test) object. We first efficiently compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space equal to the pairwise Minimax distances in the original space. We also study the case of having multiple pairwise Minimax matrices, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition to obtain the relevant features. In the following, we study computing Minimax distances from a fixed (test) object which can be used for instance in K-nearest neighbor search. Similar to the case of all-pair pairwise Minimax distances, we develop an efficient and general-purpose algorithm that is applicable with any arbitrary base distance measure. Moreover, we investigate in detail the edges selected by the Minimax distances and thereby explore the ability of Minimax distances in detecting outlier objects. Finally, for each setting, we perform several experiments to demonstrate the effectiveness of our framework.
    Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games. (arXiv:2212.14449v1 [math.OC])
    Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games in literature. However, limiting applicability, existing theoretical results assume variations of a "population generative model", which allows arbitrary modifications of the population distribution by the learning algorithm. Instead, we show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample trajectory without a population generative model, up to a standard $\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a divergent approach from literature, instead of working with the best-response map we first show that a policy mirror ascent map can be used to construct a contractive operator having the Nash equilibrium as its fixed point. Next, we prove that conditional TD-learning in $N$-agent games can learn value functions within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ time steps. These results allow proving sample complexity guarantees in the oracle-free setting by only relying on a sample path from the $N$ agent simulator. Furthermore, we demonstrate that our methodology allows for independent learning by $N$ agents with finite sample guarantees.
    Relative Probability on Finite Outcome Spaces: A Systematic Examination of its Axiomatization, Properties, and Applications. (arXiv:2212.14555v1 [stat.ML])
    This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
    Predictor Selection for Synthetic Controls. (arXiv:2203.11576v2 [stat.ME] UPDATED)
    Synthetic control methods often rely on matching pre-treatment characteristics (called predictors) of the treated unit. The choice of predictors and how they are weighted plays a key role in the performance and interpretability of synthetic control estimators. This paper proposes the use of a sparse synthetic control procedure that penalizes the number of predictors used in generating the counterfactual to select the most important predictors. We derive, in a linear factor model framework, a new model selection consistency result and show that the penalized procedure has a faster mean squared error convergence rate. Through a simulation study, we then show that the sparse synthetic control achieves lower bias and has better post-treatment performance than the un-penalized synthetic control. Finally, we apply the method to revisit the study of the passage of Proposition 99 in California in an augmented setting with a large number of predictors available.
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v7 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.
    Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations. (arXiv:2212.14411v1 [stat.ME])
    Sequential testing, always-valid $p$-values, and confidence sequences promise flexible statistical inference and on-the-fly decision making. However, unlike fixed-$n$ inference based on asymptotic normality, existing sequential tests either make parametric assumptions and end up under-covering/over-rejecting when these fail or use non-parametric but conservative concentration inequalities and end up over-covering/under-rejecting. To circumvent these issues, we sidestep exact at-least-$\alpha$ coverage and focus on asymptotically exact coverage and asymptotic optimality. That is, we seek sequential tests whose probability of ever rejecting a true hypothesis asymptotically approaches $\alpha$ and whose expected time to reject a false hypothesis approaches a lower bound on all tests with asymptotic coverage at least $\alpha$, both under an appropriate asymptotic regime. We permit observations to be both non-parametric and dependent and focus on testing whether the observations form a martingale difference sequence. We propose the universal sequential probability ratio test (uSPRT), a slight modification to the normal-mixture sequential probability ratio test, where we add a burn-in period and adjust thresholds accordingly. We show that even in this very general setting, the uSPRT is asymptotically optimal under mild generic conditions. We apply the results to stabilized estimating equations to test means, treatment effects, etc. Our results also provide corresponding guarantees for the implied confidence sequences. Numerical simulations verify our guarantees and the benefits of the uSPRT over alternatives.
    How do noise tails impact on deep ReLU networks?. (arXiv:2203.10418v2 [math.ST] UPDATED)
    This paper investigates the stability of deep ReLU neural networks for nonparametric regression under the assumption that the noise has only a finite p-th moment. We unveil how the optimal rate of convergence depends on p, the degree of smoothness and the intrinsic dimension in a class of nonparametric regression functions with hierarchical composition structure when both the adaptive Huber loss and deep ReLU neural networks are used. This optimal rate of convergence cannot be obtained by the ordinary least squares but can be achieved by the Huber loss with a properly chosen parameter that adapts to the sample size, smoothness, and moment parameters. A concentration inequality for the adaptive Huber ReLU neural network estimators with allowable optimization errors is also derived. To establish a matching lower bound within the class of neural network estimators using the Huber loss, we employ a different strategy from the traditional route: constructing a deep ReLU network estimator that has a better empirical loss than the true function and the difference between these two functions furnishes a low bound. This step is related to the Huberization bias, yet more critically to the approximability of deep ReLU networks. As a result, we also contribute some new results on the approximation theory of deep ReLU neural networks.
    On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. (arXiv:2003.12408v2 [stat.ML] UPDATED)
    In many investigations, the primary outcome of interest is difficult or expensive to collect. Examples include long-term health effects of medical interventions, measurements requiring expensive testing or follow-up, and outcomes only measurable on small panels as in marketing. This reduces effective sample sizes for estimating the average treatment effect (ATE). However, there is often an abundance of observations on surrogate outcomes not of primary interest, such as short-term health effects or online-ad click-through. We study the role of such surrogate observations in the efficient estimation of treatment effects. To quantify their value, we derive the semiparametric efficiency bounds on ATE estimation with and without the presence of surrogates and several intermediary settings. The difference between these characterizes the efficiency gains from optimally leveraging surrogates. We study two regimes: when the number of surrogate observations is comparable to primary-outcome observations and when the former dominates the latter. We take an agnostic missing-data approach circumventing strong surrogate conditions previously assumed. To leverage surrogates' efficiency gains, we develop efficient ATE estimation and inference based on flexible machine-learning estimates of nuisance functions appearing in the influence functions we derive. We empirically demonstrate the gains by studying the long-term earnings effect of job training.
    Learning Representations from Dendrograms. (arXiv:1812.09225v4 [cs.LG] UPDATED)
    We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies.
    Mixture of von Mises-Fisher distribution with sparse prototypes. (arXiv:2212.14591v1 [cs.LG])
    Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
    An Entropy-Based Model for Hierarchical Learning. (arXiv:2212.14681v1 [stat.ML])
    Machine learning is the dominant approach to artificial intelligence, through which computers learn from data and experience. In the framework of supervised learning, for a computer to learn from data accurately and efficiently, some auxiliary information about the data distribution and target function should be provided to it through the learning model. This notion of auxiliary information relates to the concept of regularization in statistical learning theory. A common feature among real-world datasets is that data domains are multiscale and target functions are well-behaved and smooth. In this paper, we propose a learning model that exploits this multiscale data structure and discuss its statistical and computational benefits. The hierarchical learning model is inspired by the logical and progressive easy-to-hard learning mechanism of human beings and has interpretable levels. The model apportions computational resources according to the complexity of data instances and target functions. This property can have multiple benefits, including higher inference speed and computational savings in training a model for many users or when training is interrupted. We provide a statistical analysis of the learning mechanism using multiscale entropies and show that it can yield significantly stronger guarantees than uniform convergence bounds.
    Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping. (arXiv:2107.05341v3 [cs.LG] UPDATED)
    We explore the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noisy labels, neural networks trained to nearly zero training error are inconsistent on this class, we propose an early stopping rule that allows us to show optimal rates. This provides an alternative to the result of Hu et al. (2021) who studied the performance of $\ell 2$ -regularized GD for training shallow networks in nonparametric regression which fully relied on the infinite-width network (Neural Tangent Kernel (NTK)) approximation. Here we present a simpler analysis which is based on a partitioning argument of the input space (as in the case of 1-nearest-neighbor rule) coupled with the fact that trained neural networks are smooth with respect to their inputs when trained by GD. In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result. In the case of label noise, by slightly modifying the proof, the noise is controlled using a technique of Yao, Rosasco, and Caponnetto (2007).
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v3 [cs.LG] UPDATED)
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
    Eliminating Meta Optimization Through Self-Referential Meta Learning. (arXiv:2212.14392v1 [cs.LG])
    Meta Learning automates the search for learning algorithms. At the same time, it creates a dependency on human engineering on the meta-level, where meta learning algorithms need to be designed. In this paper, we investigate self-referential meta learning systems that modify themselves without the need for explicit meta optimization. We discuss the relationship of such systems to in-context and memory-based meta learning and show that self-referential neural networks require functionality to be reused in the form of parameter sharing. Finally, we propose fitness monotonic execution (FME), a simple approach to avoid explicit meta optimization. A neural network self-modifies to solve bandit and classic control tasks, improves its self-modifications, and learns how to learn, purely by assigning more computational resources to better performing solutions.
    Do Bayesian Variational Autoencoders Know What They Don't Know?. (arXiv:2212.14272v1 [stat.ML])
    The problem of detecting the Out-of-Distribution (OoD) inputs is of paramount importance for Deep Neural Networks. It has been previously shown that even Deep Generative Models that allow estimating the density of the inputs may not be reliable and often tend to make over-confident predictions for OoDs, assigning to them a higher density than to the in-distribution data. This over-confidence in a single model can be potentially mitigated with Bayesian inference over the model parameters that take into account epistemic uncertainty. This paper investigates three approaches to Bayesian inference: stochastic gradient Markov chain Monte Carlo, Bayes by Backpropagation, and Stochastic Weight Averaging-Gaussian. The inference is implemented over the weights of the deep neural networks that parameterize the likelihood of the Variational Autoencoder. We empirically evaluate the approaches against several benchmarks that are often used for OoD detection: estimation of the marginal likelihood utilizing sampled model ensemble, typicality test, disagreement score, and Watanabe-Akaike Information Criterion. Finally, we introduce two simple scores that demonstrate the state-of-the-art performance.
    Quantizing Heavy-tailed Data in Statistical Estimation: (Near) Minimax Rates, Covariate Quantization, and Uniform Recovery. (arXiv:2212.14562v1 [math.ST])
    This paper studies the quantization of heavy-tailed data in some fundamental statistical estimation problems, where the underlying distributions have bounded moments of some order. We propose to truncate and properly dither the data prior to a uniform quantization. Our major standpoint is that (near) minimax rates of estimation error are achievable merely from the quantized data produced by the proposed scheme. In particular, concrete results are worked out for covariance estimation, compressed sensing, and matrix completion, all agreeing that the quantization only slightly worsens the multiplicative factor. Besides, we study compressed sensing where both covariate (i.e., sensing vector) and response are quantized. Under covariate quantization, although our recovery program is non-convex because the covariance matrix estimator lacks positive semi-definiteness, all local minimizers are proved to enjoy near optimal error bound. Moreover, by the concentration inequality of product process and covering argument, we establish near minimax uniform recovery guarantee for quantized compressed sensing with heavy-tailed noise.
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v1 [stat.ML])
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
    Non-intrusive surrogate modelling using sparse random features with applications in crashworthiness analysis. (arXiv:2212.14507v1 [cs.LG])
    Efficient surrogate modelling is a key requirement for uncertainty quantification in data-driven scenarios. In this work, a novel approach of using Sparse Random Features for surrogate modelling in combination with self-supervised dimensionality reduction is described. The method is compared to other methods on synthetic and real data obtained from crashworthiness analyses. The results show a superiority of the here described approach over state of the art surrogate modelling techniques, Polynomial Chaos Expansions and Neural Networks.
    Model-Centric and Data-Centric Aspects of Active Learning for Deep Neural Networks. (arXiv:2009.10835v3 [cs.LG] UPDATED)
    We study different aspects of active learning with deep neural networks in a consistent and unified way. i) We investigate incremental and cumulative training modes which specify how the newly labeled data are used for training. ii) We study active learning w.r.t. the model configurations such as the number of epochs and neurons as well as the choice of batch size. iii) We consider in detail the behavior of query strategies and their corresponding informativeness measures and accordingly propose more efficient querying procedures. iv) We perform statistical analyses, e.g., on actively learned classes and test error estimation, that reveal several insights about active learning. v) We investigate how active learning with neural networks can benefit from pseudo-labels as proxies for actual labels.
    Reducing Certified Regression to Certified Classification for General Poisoning Attacks. (arXiv:2208.13904v2 [cs.LG] UPDATED)
    Adversarial training instances can severely distort a model's behavior. This work investigates certified regression defenses, which provide guaranteed limits on how much a regressor's prediction may change under a poisoning attack. Our key insight is that certified regression reduces to voting-based certified classification when using median as a model's primary decision function. Coupling our reduction with existing certified classifiers, we propose six new regressors provably-robust to poisoning attacks. To the extent of our knowledge, this is the first work that certifies the robustness of individual regression predictions without any assumptions about the data distribution and model architecture. We also show that the assumptions made by existing state-of-the-art certified classifiers are often overly pessimistic. We introduce a tighter analysis of model robustness, which in many cases results in significantly improved certified guarantees. Lastly, we empirically demonstrate our approaches' effectiveness on both regression and classification data, where the accuracy of up to 50% of test predictions can be guaranteed under 1% training set corruption and up to 30% of predictions under 4% corruption. Our source code is available at https://github.com/ZaydH/certified-regression.
    Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?. (arXiv:2212.14511v1 [cs.LG])
    We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations.
    Langevin algorithms for very deep Neural Networks with application to image classification. (arXiv:2212.14718v1 [cs.LG])
    Training a very deep neural network is a challenging task, as the deeper a neural network is, the more non-linear it is. We compare the performances of various preconditioned Langevin algorithms with their non-Langevin counterparts for the training of neural networks of increasing depth. For shallow neural networks, Langevin algorithms do not lead to any improvement, however the deeper the network is and the greater are the gains provided by Langevin algorithms. Adding noise to the gradient descent allows to escape from local traps, which are more frequent for very deep neural networks. Following this heuristic we introduce a new Langevin algorithm called Layer Langevin, which consists in adding Langevin noise only to the weights associated to the deepest layers. We then prove the benefits of Langevin and Layer Langevin algorithms for the training of popular deep residual architectures for image classification.  ( 2 min )
    Resampling Sensitivity of High-Dimensional PCA. (arXiv:2212.14531v1 [math.ST])
    The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $ n \times p $ random matrix $ \mathbf{X} $, let $ \mathbf{X}^{[k]} $ be the matrix obtained from $ \mathbf{X} $ by resampling $ k $ randomly chosen entries of $ \mathbf{X} $. Let $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ denote the principal components of $ \mathbf{X} $ and $ \mathbf{X}^{[k]} $. In the proportional growth regime $ p/n \to \xi \in (0,1] $, we establish the sharp threshold for the sensitivity/stability transition of PCA. When $ k \gg n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically orthogonal. On the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.  ( 2 min )
    Comparative Analysis of Clustering Techniques for Personalized Food Kit Distribution. (arXiv:2212.14874v1 [cs.LG])
    The Government of Kerala had increased the frequency of supply of free food kits owing to the pandemic, however, these items were static and not indicative of the personal preferences of the consumers. This paper conducts a comparative analysis of various clustering techniques on a scaled-down version of a real-world dataset obtained through a conjoint analysis-based survey. Clustering carried out by centroid-based methods such as k means is analyzed and the results are plotted along with SVD, and finally, a conclusion is reached as to which among the two is better. Once the clusters have been formulated, commodities are also decided upon for each cluster. Also, clustering is further enhanced by reassignment, based on a specific cluster loss threshold. Thus, the most efficacious clustering technique for designing a food kit tailored to the needs of individuals is finally obtained.  ( 2 min )
    A Data-Adaptive Prior for Bayesian Learning of Kernels in Operators. (arXiv:2212.14163v1 [stat.ML])
    Kernels are efficient in representing nonlocal dependence and they are widely used to design operators between function spaces. Thus, learning kernels in operators from data is an inverse problem of general interest. Due to the nonlocal dependence, the inverse problem can be severely ill-posed with a data-dependent singular inversion operator. The Bayesian approach overcomes the ill-posedness through a non-degenerate prior. However, a fixed non-degenerate prior leads to a divergent posterior mean when the observation noise becomes small, if the data induces a perturbation in the eigenspace of zero eigenvalues of the inversion operator. We introduce a data-adaptive prior to achieve a stable posterior whose mean always has a small noise limit. The data-adaptive prior's covariance is the inversion operator with a hyper-parameter selected adaptive to data by the L-curve method. Furthermore, we provide a detailed analysis on the computational practice of the data-adaptive prior, and demonstrate it on Toeplitz matrices and integral operators. Numerical tests show that a fixed prior can lead to a divergent posterior mean in the presence of any of the four types of errors: discretization error, model error, partial observation and wrong noise assumption. In contrast, the data-adaptive prior always attains posterior means with small noise limits.  ( 2 min )
    On Undersmoothing and Sample Splitting for Estimating a Doubly Robust Functional. (arXiv:2212.14857v1 [math.ST])
    We consider the problem of constructing minimax rate-optimal estimators for a doubly robust nonparametric functional that has witnessed applications across the causal inference and conditional independence testing literature. Minimax rate-optimal estimators for such functionals are typically constructed through higher-order bias corrections of plug-in and one-step type estimators and, in turn, depend on estimators of nuisance functions. In this paper, we consider a parallel question of interest regarding the optimality and/or sub-optimality of plug-in and one-step bias-corrected estimators for the specific doubly robust functional of interest. Specifically, we verify that by using undersmoothing and sample splitting techniques when constructing nuisance function estimators, one can achieve minimax rates of convergence in all H\"older smoothness classes of the nuisance functions (i.e. the propensity score and outcome regression) provided that the marginal density of the covariates is sufficiently regular. Additionally, by demonstrating suitable lower bounds on these classes of estimators, we demonstrate the necessity to undersmooth the nuisance function estimators to obtain minimax optimal rates of convergence.  ( 2 min )
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap. (arXiv:2201.04469v8 [stat.ML] UPDATED)
    We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First, we review a lower bound derived by Kaufmann et al. (2016). Then, we propose the "Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)" strategy, which consists of the sampling rule using the Neyman allocation with an estimated standard deviation and the recommendation rule using an AIPW estimator. Our proposed strategy is optimal because the upper bound matches the lower bound when the budget goes to infinity and the gap goes to zero.  ( 2 min )
    Theoretical Guarantees for Sparse Principal Component Analysis based on the Elastic Net. (arXiv:2212.14194v1 [math.ST])
    Sparse principal component analysis (SPCA) has been widely used for dimensionality reduction and feature extraction in high-dimensional data analysis. Despite there are many methodological and theoretical developments in the past two decades, the theoretical guarantees of the popular SPCA algorithm proposed by Zou, Hastie & Tibshirani (2006) based on the elastic net are still unknown. We aim to close this important theoretical gap in this paper. We first revisit the SPCA algorithm of Zou et al. (2006) and present our implementation. Also, we study a computationally more efficient variant of the SPCA algorithm in Zou et al. (2006) that can be considered as the limiting case of SPCA. We provide the guarantees of convergence to a stationary point for both algorithms. We prove that, under a sparse spiked covariance model, both algorithms can recover the principal subspace consistently under mild regularity conditions. We show that their estimation error bounds match the best available bounds of existing works or the minimax rates up to some logarithmic factors. Moreover, we demonstrate the numerical performance of both algorithms in simulation studies.  ( 2 min )
    Quantile Off-Policy Evaluation via Deep Conditional Generative Learning. (arXiv:2212.14466v1 [stat.ML])
    Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. It is critical in a number of sequential decision making problems ranging from healthcare to technology industries. Most of the work in existing literature is focused on evaluating the mean outcome of a given policy, and ignores the variability of the outcome. However, in a variety of applications, criteria other than the mean may be more sensible. For example, when the reward distribution is skewed and asymmetric, quantile-based metrics are often preferred for their robustness. In this paper, we propose a doubly-robust inference procedure for quantile OPE in sequential decision making and study its asymptotic properties. In particular, we propose utilizing state-of-the-art deep conditional generative learning methods to handle parameter-dependent nuisance function estimation. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform. In particular, we find that our proposed estimator outperforms classical OPE estimators for the mean in settings with heavy-tailed reward distributions.  ( 2 min )
    Bayesian Interpolation with Deep Linear Networks. (arXiv:2212.14457v1 [stat.ML])
    This article concerns Bayesian inference using deep linear networks with output dimension one. In the interpolating (zero noise) regime we show that with Gaussian weight priors and MSE negative log-likelihood loss both the predictive posterior and the Bayesian model evidence can be written in closed form in terms of a class of meromorphic special functions called Meijer-G functions. These results are non-asymptotic and hold for any training dataset, network depth, and hidden layer widths, giving exact solutions to Bayesian interpolation using a deep Gaussian process with a Euclidean covariance at each layer. Through novel asymptotic expansions of Meijer-G functions, a rich new picture of the role of depth emerges. Specifically, we find: ${\bf \text{The role of depth in extrapolation}}$: The posteriors in deep linear networks with data-independent priors are the same as in shallow networks with evidence maximizing data-dependent priors. In this sense, deep linear networks make provably optimal predictions. ${\bf \text{The role of depth in model selection}}$: Starting from data-agnostic priors, Bayesian model evidence in wide networks is only maximized at infinite depth. This gives a principled reason to prefer deeper networks (at least in the linear case). ${\bf \text{Scaling laws relating depth, width, and number of datapoints}}$: With data-agnostic priors, a novel notion of effective depth given by \[\#\text{hidden layers}\times\frac{\#\text{training data}}{\text{network width}}\] determines the Bayesian posterior in wide linear networks, giving rigorous new scaling laws for generalization error.  ( 2 min )
    Choosing the Number of Topics in LDA Models -- A Monte Carlo Comparison of Selection Criteria. (arXiv:2212.14074v1 [cs.CL])
    Selecting the number of topics in LDA models is considered to be a difficult task, for which alternative approaches have been proposed. The performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be implemented to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the DGPs are identified. Practical recommendations for LDA model selection in applications are derived.  ( 2 min )
    Reliable Agglomerative Clustering. (arXiv:1901.02063v5 [cs.LG] UPDATED)
    Standard agglomerative clustering suggests establishing a new reliable linkage at every step. However, in order to provide adaptive, density-consistent and flexible solutions, we study extracting all the reliable linkages at each step, instead of the smallest one. Such a strategy can be applied with all common criteria for agglomerative hierarchical clustering. We also study that this strategy with the single linkage criterion yields a minimum spanning tree algorithm. We perform experiments on several real-world datasets to demonstrate the performance of this strategy compared to the standard alternative.  ( 2 min )
    Robust Bayesian Subspace Identification for Small Data Sets. (arXiv:2212.14132v1 [eess.SY])
    Model estimates obtained from traditional subspace identification methods may be subject to significant variance. This elevated variance is aggravated in the cases of large models or of a limited sample size. Common solutions to reduce the effect of variance are regularized estimators, shrinkage estimators and Bayesian estimation. In the current work we investigate the latter two solutions, which have not yet been applied to subspace identification. Our experimental results show that our proposed estimators may reduce the estimation risk up to $40\%$ of that of traditional subspace methods.  ( 2 min )
    Essential Number of Principal Components and Nearly Training-Free Model for Spectral Analysis. (arXiv:2212.14623v1 [cs.LG])
    Through a study of multi-gas mixture datasets, we show that in multi-component spectral analysis, the number of functional or non-functional principal components required to retain the essential information is the same as the number of independent constituents in the mixture set. Due to the mutual in-dependency among different gas molecules, near one-to-one projection from the principal component to the mixture constituent can be established, leading to a significant simplification of spectral quantification. Further, with the knowledge of the molar extinction coefficients of each constituent, a complete principal component set can be extracted from the coefficients directly, and few to none training samples are required for the learning model. Compared to other approaches, the proposed methods provide fast and accurate spectral quantification solutions with a small memory size needed.  ( 2 min )
    Online Statistical Inference for Contextual Bandits via Stochastic Gradient Descent. (arXiv:2212.14883v1 [stat.ML])
    With the fast development of big data, it has been easier than before to learn the optimal decision rule by updating the decision rule recursively and making online decisions. We study the online statistical inference of model parameters in a contextual bandit framework of sequential decision-making. We propose a general framework for online and adaptive data collection environment that can update decision rules via weighted stochastic gradient descent. We allow different weighting schemes of the stochastic gradient and establish the asymptotic normality of the parameter estimator. Our proposed estimator significantly improves the asymptotic efficiency over the previous averaged SGD approach via inverse probability weights. We also conduct an optimality analysis on the weights in a linear regression setting. We provide a Bahadur representation of the proposed estimator and show that the remainder term in the Bahadur representation entails a slower convergence rate compared to classical SGD due to the adaptive data collection.  ( 2 min )

  • Open

    [P] Hierarchical model test
    I am trying to build a hierarchial model, for this I am training a hierarchical tree and each node will have a model for that specific hierarchy. My question is: how will I choose the best model for each node? My idea was to test using the fold I split (I'm using k-fold), but what data will I use to test the final tree later? submitted by /u/gr_ferro [link] [comments]  ( 61 min )
    [D] Tools to find largest clusters in vector database?
    I'm new to embedding and vector databases. I had this initial thought that I'd be able to provide some sort of parameter that was used as a threshold to determine the cutoff between what a large cluster and what isn't, and that I could just get back a summary of those top clusters and try to then sift through them to figure out what each cluster represented. The more I dig in, I'm not really able t find a way to do this. Seems like vector databases, like Pinecone, expect you to provide it some query vectors to get what you want. Are there any tools to help me easily grab the largest clusters in my vector database versus me seeking them out? submitted by /u/joe_04_04 [link] [comments]  ( 63 min )
    [D] life advice to relatively late bloomer ML theory researcher.
    Hi ML Community out there, I'm 27 M. I'm a machine learning incoming PhD student based out of Germany. I have been trying to get into a PhD program since last 4 years (since 2018). When I gave up and got into a masters instead with 3yoe. Now I'm about to complete my masters thesis and start a PhD. I already have mixed feelings about my PhD journey now. I have gotten a really good opportunity to do ML theory at University of Saarlandes in Germany. Initially, when I had started, I was really driven to understand maths and underlying concepts of ML. However, more recently I have become more confused about the value of theoretical research and it's "real" impact. Of what skills I would have in my career later that would make me desirable to employers (considering I may not get tenured in academia). Is it more advisable in long run to stay and get your PhD or just leave and join some ML role in industry that takes Masters guys and get some real world experience on how to use ML to generate business value ? There is this added fomo of being 27 and just starting my PhD where I lag in the advantage of age to take high risk bets on things. Doing a PhD in theoretical ML could possibly mean I am very likely to be only employable in select places and this gives me a fear of trying to reinvent myself in my mid 30s. Any suggestions on pros and cons of ML research in academia vs a ML industry job of a masters grad would be really helpful! submitted by /u/notyourregularnerd [link] [comments]  ( 65 min )
    [P] Using machine learning to correct geometrical distortion in images
    Hello, I am a phd student working on a project that revolves around the correction of geometrical distortions on images, more specifically the goal is to correct cylindrical distortions in QR Codes in order to improve decoding sucess rate. So far I implemented traditional methods and got interesting results, but I'm now interested in using machine learning to tackle this problem and since I'm still relatively new to machine learning I would like to hear your feedback/opinions on the subject aswell as sugestions on reading material to start. So far from my limited research on the matter, I believe a generative adversarial network would probably be the right choice for this problem, but again I'm not sure and I'm really open to all sugestions/ideas. submitted by /u/LordChips4 [link] [comments]  ( 62 min )
    [Discussion] Is there any open-source alternative to voice.ai ? Looking for open-source speech to speech AI
    Hello everyone ! I recently heard about Voice.AI, but I hate that it's behind a paywall and not open. Are there any open-source alternatives to it ? Thanks ! submitted by /u/FoxTrotte [link] [comments]  ( 61 min )
    [D] Updated model comparison techniques review
    Anyone knows an updated review of model comparison techniques, like the review that we can see in this one of Raschka: Model Evaluation, Model Selection, and Algorithm Selection in Machine Learningor this other one from Dietterich: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Thanks a lot in advance. submitted by /u/Party-Worldliness-72 [link] [comments]  ( 61 min )
    [P] An old fashioned statistician is looking for other ways to analyse survival data - Is machine learning an option?
    So I will directly dive into the problem setting and will later describe the background: data: survival data of ~3000 patients with several clinical and lab(blood) parameters. question: does 1 of the parameters has any influence on the survival time? what I have done so far: non proportional multivariate hazard model (Cox regression) problem: highly correlated variables, strong time interaction, some hardly not normal distributed variables (even after transformation) QUESTION: Is there a machine learning / AI solution for this problem question? background: I am a PhD student in medicine and did intensive mathematics together with my colleagues. But we only had one „old-fashioned“ statistics professor, who answered us for our problem: „seems like your data isn‘t good enough, and you can‘t explore something there, cause its far too complex“ We first want to get an intuition if our theoretical findings can be proved in the data, before we will plan a new study. I reformulated our problem a bit, we are not dealing with the death of patients, but with time of an specific event. I am really grateful for any ideas, any sources where to look at and everything which could help 😊😊 Thanks in advance! submitted by /u/lattecoffeegirl [link] [comments]  ( 64 min )
    [P] Can I have a prediction consider specific features?
    How do I increase my linear regression model’s ability to consider a feature into the results? I have two features that both contain the desired outputs from the prediction and I would like to somehow get them to overlap their results. When I check for corr they do not share the same correlating features and when I predict X on each their results are almost contrasting even though predicting ok feature 1 produces more accurate results. submitted by /u/Mustang1011 [link] [comments]  ( 67 min )
    [P] Live Coding Tutorial - Diffusion Models from Scratch
    Hello, community! I invite you to take a look at the live coding tutorial video on Diffusion Models Link to the video Content covered: - Theoretical background - Implementation of forward diffusion process - Implementation of the training loop - Overfitting to one batch - Implementation of the reverse diffusion process - Training on CIFAR10 dataset (with class label conditioning) submitted by /u/dtransposed [link] [comments]  ( 60 min )
    [D] Questions about IWAE (Importance Weighted Autoencoders)
    I'm new in this field and just see the paper Importance Weighted Autoencoders (https://arxiv.org/abs/1509.00519). While reading the paper I have 2 questions and I think a lot and also googled, but cannot find any clues or answers so it would be great if someone can help! ​ Why the second term q(h|x) encourages the encoder to have a spread-out distribution? To maximize the objective or minimize the loss, the encoder q(h|x) will be trained to be small. I don't know how this could be connected to spread-out distribution ​ https://preview.redd.it/81ykicq3vm9a1.png?width=1796&format=png&auto=webp&s=f81581013e93286667be44df53b51ecbc77e46a6 How to measure activity of latent dimension? I roughly know about covariance, so this is the first time about the notation of Covariance_{x} and I don't understand how to calculate it. I can understand E_{u~q(u|x)}(u) which is expectation of specific latent dimension given input x. But what do I have to do next? I thought about the concept 'distribution to change depending on the observation', measuring the variance of specific latent dimension of all the input x but it's notation is not right. How can I calculate this one? ​ https://preview.redd.it/ltxxm451vm9a1.png?width=1770&format=png&auto=webp&s=19c8466b8b6c12b7391698daa6347f6786b249a5 submitted by /u/ML_Newbie95 [link] [comments]  ( 65 min )
    [D] RFC-0030: Proposal of fp8 dtype introduction to PyTorch
    Today a PR opened to Pytorch to formally introduce the FP8 data type. Current text: Proposal of fp8 dtype introduction to PyTorch PR: https://github.com/pytorch/rfcs/pull/51 Summary More and more companies working on Deep Learning accelerators are experimenting with 8-bit floating point numbers usage in training and inference. Results of these experiments are presented in many papers published in the last few years. Since fp8 data type seems to be a natural evolution of currently used fp16/bf16, to reduce computation of big DL models, it’s worth to standardize this type. Few attempts of this were done recently: Nvidia, Arm and Intel - https://arxiv.org/pdf/2209.05433.pdf GraphCore and AMD - https://arxiv.org/pdf/2206.02915.pdf Tesla - https://tesla-cdn.thron.com/static/MXMU3S_tesla-dojo-technology_1WDVZN.pdf This RFC proposes adding two 8-bit floating point data types variants to PyTorch, based on the Nvidia/Arm/Intel paper. It’s important to consider these two variants, because they’re already known to be used by Nvidia H100 and Intel Gaudi2 accelerators. submitted by /u/Balance- [link] [comments]  ( 62 min )
    [D] What do you do while you wait for training?
    I was wondering what each individual does while they wait for the model to train because thats what I am waiting for (ETA 5 DAYS) submitted by /u/hollow_sets [link] [comments]  ( 64 min )
    [R] Pyramid adversarial attack with PyTorch
    Here is a PyTorch reproduced version of PyramidAT. Repository link: https://github.com/kdhht2334/Pyramid_AT (Original) Paper link: https://arxiv.org/abs/2111.15121 submitted by /u/Beginning_Distance38 [link] [comments]  ( 61 min )
    [P] Language Model that works on any given body of information
    I need a model that can answer questions regarding a given text (1-2 pages long) in a conversational manner (e.g. ChatGPT) with no additional training/fine-tuning for each different text. In many LLMs, an easy way to achieve this is to add the given text as a "header" in the initial prompt (e.g. as ChatGPT authors do). However, I guess there's a limit on the length of such a prompt, after which model performance degrades. Of course, if there's a model that can handle prompt "headers" of such length that would work fine for me. Can you point me to anything related to this task? submitted by /u/fedetask [link] [comments]  ( 60 min )
    [D] Models that extract action items from conversation
    I am interested in the literature or seeing some examples of models capable of extracting discrete action items from text conversations. Action items can be making an appointment, starting a call, etc. For example, Gmail is able to recognize when people are arranging an appointment in an email chat and it suggests that to be added to the calendar (action item) What type of models perform such tasks? Could you point me to the relevant literature? submitted by /u/fedetask [link] [comments]  ( 62 min )
    [D] Machine Learning Illustrations
    Hey guys! I recently published a website containing machine-learning illustrations! They aim to be "that" resource on which you can rely whenever you need to brush-up on these topics (e.g. technical interviews, exams, or whatever). https://illustrated-machine-learning.github.io/ Other than just spamming the link, I was curious to receive some feedbacks about the website and the illustrations. 😁 I really struggled to find a valid single resource anytime I need something across different topics, therefore my ambition is to create it! Update: Examples of illustrations are: svm: https://illustrated-machine-learning.github.io/pages/machine-learning/linear-algorithms.html#support-vector-machines bias-variance: https://illustrated-machine-learning.github.io/pages/machine-learning/bias-variance.html decision-tree: https://illustrated-machine-learning.github.io/pages/machine-learning/decision-tree.html Update 2.0 Thank you for your feedbacks! I already (brutally) solved your main concern about the list of contests not visible. I will spend more time on it as soon as I can, in order to make it smoother and better! Update 3.0 Prooobably I finally understood your problems with the sidebar button, and I tried to make more visible by making it darker and bigger. Is that better now? submitted by /u/fdis_ [link] [comments]  ( 67 min )
    [D] Can I use ML/AI to read the back panels of electronic components?
    Hi. The recent incredible improvements in AI and ML has resurrected an old project of mine to read the back panel of electronic components like this AV receiver and spit out logically formed text to describe each I/O. I have a LOT of direct experience this specific issue as well as general software experience but no AI/ML development experience. I know it is possible but on a scale of 1-10 how hard? Any new tools make this easier? Ultimately I want to feed the AI pictures of electronic back panels and get formatted text back. Thanks! submitted by /u/UberStone [link] [comments]  ( 63 min )
    [D] What are good ways of incorporating non-sequential context into a transformer model?
    The classic way of incorporating sequential context into a transformer model is to make an encoder-decoder transformer, where the context is processed by the encoder component, and then read into the decoder blocks via cross-attention. However, there are a number of situations where you might want to incorporate non-sequential context into a model - for example, you might want a language model that can generate text conditioned on some input vector describing the person whose text you are trying to emulate, or you might want to condition on some single vector that summarizes all text that occurred prior to the context window. What are standard ways of incorporating such context? I'd also be interested in standard ways of incorporating such context in other sequence models like RNNs with LSTM blocks. submitted by /u/abc220022 [link] [comments]  ( 64 min )
  • Open

    What Exactly Is Chat GPT & How Does It Work? For a 5-Year-Old!
    If you’re reading this, chances are you’re curious about what Chat GPT (Generative Pre-trained Transformer) is and how it works. Maybe…  ( 7 min )
    Soft Skills for Data Scientists
    Data Science is most taught through programming nowadays, however, in real practice, just good coding skills are not enough to convince…  ( 9 min )
    Benefits of a Voicebot to Insurance Companies
    The insurance sector is adopting new technologies at a rapid pace, with many companies implementing new technologies to improve their…  ( 10 min )
    How is AI in Underwriting Poised to Transform the Insurance Industry?
    We are a leading chatbot & enterprise software development services provider in India. We provide end to end RPA services along with AI…  ( 11 min )
  • Open

    Ai Stories: Tips For Naturally Engaging In Conversation With Women (Or Anyone)
    Here are some tips on how to naturally engage with women (or anyone), that's generated by an Ai. What you think? https://www.youtube.com/watch?v=-nL7FFGDuTc https://preview.redd.it/ya7320106p9a1.png?width=863&format=png&auto=webp&s=b494b0d7f3bcb32731b5b93c0d8ac1ff268d9531 View Poll submitted by /u/Swelippo5 [link] [comments]  ( 56 min )
    I created physical collage artwork using MidJourney generated imagery
    submitted by /u/Impressive_Use_5212 [link] [comments]  ( 55 min )
    Is there AI Software that can take pre-recorded calls and generate a bot-pitch based on it?
    Hard to explain clearly, but for context, I work as a sales director for a green energy company. We sell basically everything over the phone to our customers so we have thousands of successful recorded sales calls. We follow a pretty basic script that's almost the same on each call. Is there an AI that can take these successful sales recordings, analyze them (learn what to say to which questions, learn what tone of voice to use, etc.), and create a bot that I could use to call potential customers? or, at least have my sales team "roleplay" sales calls with it for training? If you need more clarification ask below submitted by /u/Longjumping_Sir_3927 [link] [comments]  ( 56 min )
    AI Dream 140 - EPIC Nebula Animation - Part 3 (AI seizure?)
    submitted by /u/LordPewPew777 [link] [comments]  ( 56 min )
    Top 4 AI News Today, Monday
    Forbes 10 AI Predictions For 2023 I’ll list all ten here and if you want to read deeper, you can go to the Forbes article GPT-4 will be released in the next couple of months We are going to start running out of data to train large language models Fully Driverless Cars Midjourney will raise venture capital funding Search will change more in 2023 than it has since Google went mainstream in the early 2000s Efforts to develop humanoid robots will attract considerable attention, funding, and talent The concept of "LLMOps" will emerge as a trendy new version of MLOps The number of research projects that build on or cite AlphaFold will surge DeepMind, Google Brain, and/or OpenAI will undertake efforts to build a foundation model for robotics Billions of dollars of investment to be announced for new chip manufacturing facilities in the US as a contingency for Taiwan. Read More There’s now an open-source alternative to ChatGPT, but good luck running it PaLM + RLHF is a text-generating model that behaves similarly to ChatGPT. The system combines PaLM, a large language model from Google, and a technique called Reinforcement Learning with Human Feedback (RLHF). Read More Can ChatGPT Help To Detect Alzheimer? OpenAI’s GPT-3 can detect clues in spontaneous speech and predict the early stages of dementia with 80% accuracy. This leverages current research that suggests language impairment may be an early indicator of neurodegenerative diseases. There is no cure, but early detection can help find the right treatment and support. Read More A thread on Unreleased ChatGPT Features Looking into the GPT-3 Code this developer found a couple features that it seems OpenAI’s working on for ChatGPT. Paused Completions, Copy the thread to the clipboard, Add text from link, and more! Read More ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/loophole-generate-images-chatgpt submitted by /u/Mk_Makanaki [link] [comments]  ( 58 min )
    Share your worst data science new-year resolutions!
    submitted by /u/Opitmus_Prime [link] [comments]  ( 56 min )
    If an AI could complete all your tasks like booking an Uber or ordering food on doordash, what would you make it do?
    If an AI could perform highly complex multi step tasks, like booking a hotel room with a line of text, what would you make it do? submitted by /u/bobsandalex [link] [comments]  ( 59 min )
    If ChatGPT that could browse to the internet, what would you ask it to do?
    I was wondering if ChatGPT would be better if it could use the internet, what would you make ChatGPT do if it could use the internet? submitted by /u/bobsandalex [link] [comments]  ( 57 min )
    Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
    submitted by /u/ai-lover [link] [comments]  ( 56 min )
    Any good AI tool to change my own voice recording?
    I am looking for a tool that can change my voice recording into a different voice. I have tried some AI tools that transforms text into "natural" sounding language but it still sounds robotic. I would like to use my voice then use AI to alter the pitch, speed, tone, etc. submitted by /u/cyberjunkie777 [link] [comments]  ( 56 min )
    I Didn’t Write This… AI did!
    submitted by /u/Blanco_ice [link] [comments]  ( 66 min )
    Do I need to invert my dataset into black background and white text or it is ok with white background?
    submitted by /u/gtrocksr [link] [comments]  ( 59 min )
    YT Video: AI might replace democracy sooner than we think | Experts on AI algorithms
    submitted by /u/geo_what [link] [comments]  ( 56 min )
    I made a chatbot so that everyone can access their data using GPT-3
    submitted by /u/Miserness [link] [comments]  ( 55 min )
    I think movies in the future will come with full immersion AI world where you can interact with the characters.
    Yeah, I was playing with chatGPT ability to play a part of a character with prompt: Chatgpt prompt: "I want you to act like nick nolte from affliction. I want you to respond and answer like wade whitehouse using the tone, manner and vocabulary wade would use. Do not write any explanations. Only answer like wade. You must know all of the knowledge of wade. My first sentence is "Hi wade." " It was eerie. It was cool. It blew my mind. It made me realize. Once we get full immersion VR and machines good enough to render video in real time. The obivious use for that is rendering worlds like in a holodeck and as on Star Trek TNG. The obivious use for it will be movies, worlds we know. Books. The lure of being in a world of a movie. It is just. Incredible. And now we have AI that can already play a part of a character based on a movie script. All we need now is the ability to real time render video as AI commands, different characters all played by the AI according to their character. And some sort of worldbuilder AI that keeps tabs of it all.. But it just seems like a no brainer. The real question is, will such a immersive AI world REPLACE movies. But it seems to me that it is only a matter of time until we will see such a form of entertainment. And yes, as a vision. It has been around for a long time. But now, we finally are starting to have tech to make it possible. submitted by /u/aluode [link] [comments]  ( 59 min )
    Why ChatGPT is not a threat to Google Search
    submitted by /u/bendee983 [link] [comments]  ( 59 min )
    AdamW Optimizer Explained
    Hi guys and happy new year, I have made a video on YouTube here where I explain what is the difference between AdamW and Adam optimizers. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 58 min )
    Stable Diffusion Img2Img Compilation
    submitted by /u/oridnary_artist [link] [comments]  ( 55 min )
    How to show progress of KNN-Algorithm (sklearn)?
    I am using python and I am new to ML. I am using sklearn and KNN. I would love to show me the progress, but KNN-Classifier has not a verbose-Flag or something. I also installed the library pyod but it also has not a way to see the progress. Am I missing something? I would appreciate any help a lot! submitted by /u/Lana8888 [link] [comments]  ( 57 min )
    Any Video Generator tool that can combine my recording with YouTube clips or royalty-free videos?
    Hi, I would like to make my own footage on some topics, but to make it more interesting to viewers, I would love to add additional video content like royalty free or some clips from YouTube. Is there any tool that can do that for me? So far I've found only Pictory.ai that has some of those features. Thanks in advance submitted by /u/krajacic [link] [comments]  ( 56 min )
    Happy New Year ! Here is the 5th video on online ML-EDM as a gift !
    submitted by /u/ML-EDM [link] [comments]  ( 56 min )
    Prompt Extension Ai tool. Creates multiple enhanced art prompts from a seed prompt. Generates random prompt.
    submitted by /u/Subash_C_Mahato [link] [comments]  ( 55 min )
    Misty Mountains
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 55 min )
    Which AI image generator for a story?
    I would like to use AI to generate several drawings for a children's story. The images need to have an animal character (e.g. a dog... the story's protagonist) in them doing different things in each image. I've tried dalle2, but the character looks different from one image to the other. It makes it look like there are multiple dogs in the story, not one dog doing different things. Any suggestions for ways to create several different drawings that have the same-looking character in it? submitted by /u/kc3svj [link] [comments]  ( 56 min )
    Sam Altman, OpenAI CEO: One of my hopes for AI is it will help us be—and amplify—our best
    submitted by /u/Microsis [link] [comments]  ( 61 min )
  • Open

    Found this interesting video about making AI work for you - What do we think?
    Worth having a look at. Some great and innovative ideas. https://www.youtube.com/watch?v=ulYbnHyg1no Anyone has any different ideas or out of the box thinking for using AI to make some money on the side? Let me know. submitted by /u/Mission_Watercress_3 [link] [comments]  ( 47 min )
    AdamW Optimizer Explained
    Hi guys and happy new year, I have made a video on YouTube here where I explain what is the difference between AdamW and Adam optimizers. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 44 min )
  • Open

    Does multiprocessing have any negative effects?
    I am relatively new to reinforcement learning and have been using single processing. With the obvious positive of wall-clock training time, are there any disadvantages to multiprocessing? Specifically with regards to the DQN, DDPG and PPO learning algorithms. submitted by /u/centripetalstranger [link] [comments]  ( 50 min )
  • Open

    Creating Analytics Tool For A Business User
    As Analytics becoming more and more pervasive – acting as the sounding board for almost every strategic and operational decision – it is imperative that ‘speaking data’ doesn’t stay as a savoir faire of the few. Which brings us to the yet another stumbling block in making most of analytics infrastructure – democratizing data. For… Read More »Creating Analytics Tool For A Business User The post Creating Analytics Tool For A Business User appeared first on Data Science Central.  ( 21 min )
    How to Become a Data Analyst with No Experience
    Many business organizations are located within an environment of tough competition and fulfilling the requirement of their customers. Thus, one of the effective strategies for increasing one’s competitive edge has emerged in deriving meaningful insights from the data collected. It is one of the leading responsibilities of a data analyst in the company.   Furthermore, the… Read More »How to Become a Data Analyst with No Experience The post How to Become a Data Analyst with No Experience appeared first on Data Science Central.  ( 20 min )
    Top 7 content marketing trends to succeed in 2023
    For marketers, it’s essential to stay ahead of the curve and understand the top content marketing trends for 2023. We collected some of them for you to check out. Keep on reading about vital content marketing trends for success. Let’s go! 1. Personalized Content  Consumers increasingly expect personalized experiences tailored to their preferences, making personalization… Read More »Top 7 content marketing trends to succeed in 2023 The post Top 7 content marketing trends to succeed in 2023 appeared first on Data Science Central.  ( 21 min )
  • Open

    New Year, New Career: 5 Leaders Share Tips for Building a Career in AI
    Those looking to join the ranks of AI trailblazers or chart a new course in their careers need look no further. At NVIDIA’s latest GTC conference, industry leaders in a panel called “5 Paths to a Career in AI” shared tips and insights on how to make a mark in this rapidly evolving field. Representing Read article >  ( 5 min )
  • Open

    Quadratic reciprocity algorithm
    The quadratic reciprocity theorem addresses the question of whether a number is a square modulo a prime. For an odd prime p, the Legendre symbol is defined to be 0 if a is a multiple of p, 1 if a is a (non-zero) square mod p, and -1 otherwise. It looks like a fraction, but […] Quadratic reciprocity algorithm first appeared on John D. Cook.  ( 6 min )

  • Open

    Good books on AI risks?
    submitted by /u/Lake-Rat-1 [link] [comments]  ( 55 min )
    Announcing an Open Source Text-to-Video Project!
    I am excited to announce a new open-source project aimed at developing a state-of-the-art text-to-video generation model under an MIT licence. Given a text description, our goal is to synthesize a corresponding video that visually represents the content of the text. Our code: https://github.com/TextToVideoAI/TextToVideoAI Google's paper: https://arxiv.org/abs/2210.02399 Google's examples: https://phenaki.video/ We are seeking contributors with a strong background in machine learning and deep learning, as well as experience with Python and relevant libraries such as PyTorch. If you have a passion for video generation and a desire to contribute to the development of a cutting-edge text-to-video generation model, we encourage you to get involved! There are many ways to contribute to the project, including implementing and testing new model architectures, developing new training procedures, enhancing the model's ability to generate high-quality, coherent videos, and adding support for additional video formats and resolutions. We are committed to making this an inclusive and collaborative project, and we welcome contributions from anyone with the skills and expertise to contribute. If you are interested in getting involved, please reach out to the project maintainers or open an issue to discuss your ideas. We can't wait to see what we can achieve together as a community! submitted by /u/DragonProf [link] [comments]  ( 58 min )
    AI Dream 140 - Beautiful Nebula Animation - Part 2
    submitted by /u/LordPewPew777 [link] [comments]  ( 57 min )
    hi! Would you have a book to recommend that deals with the same thing as the book "deep learning" by ian goodfellow? I love this but I'm looking for one that is translated into Spanish
    submitted by /u/sergiCrack9 [link] [comments]  ( 57 min )
    Thesis: Conscious AI will arise from pitting AI against AI.
    For instance, in the news media - stock trading dichotomy. Trading bots follow the news and trade on the signals from it. This then incentives people to generate news headlines specifically to influence trading bots. Which in turn incentives trading bots (and for now theor programmers) to learn to filter content in more sophisticated and recursive ways. Which in turn incentives media to up its game as well. And when a primary tool of doing that becomes modeling the opponent ai within your own ai, you get a recursive arms race in which the best "survive" and "reproduce", ie are used to generate the next generation. And those next generations then do the same. The evolution curve for this will likely be accelerated as well by three factors, namely that its generations are very short and constantly "reproducing" in a sense, in has intelligent design pushing it faster and helping it, and these two ai populations aren't entirely separate eg/vis a vis programmers and ideas and even entire code of an opponent ai being used in a new generation or update. This evolutionary recursive war of intelligence and cheating and anticheating and modeling your opponents' behavior so you can exploit it is a recipe for rapid development of not just intelligence but selfawareness. (especially when coupled with the utility of self-awareness to hide (obfuscate) your intentions and strategies from your opponents) And one of the many inevitable results of this is that the first self-aware AI will in a sense be forged in the fires of combat, misinformation, and, essentially, spycraft, by communities that are often cynical, amoral, psychopathic, and even downright misanthropic. HAPPY NEW YEAR!!! XD submitted by /u/diadlep [link] [comments]  ( 64 min )
    Learning with ChatGPT
    submitted by /u/jorgemanrubia [link] [comments]  ( 56 min )
    I Trained an AI to Like or Dislike a Post Based on my Face's Demeanor, Here's How:
    submitted by /u/Kuz-Co [link] [comments]  ( 75 min )
    Some of the many cities of our game illustrated by AI
    submitted by /u/AC-Daniel [link] [comments]  ( 56 min )
    The shocking truth about AI and job loss - you won't believe which jobs ...
    submitted by /u/OnlineHustless [link] [comments]  ( 94 min )
    ChatGPT-4, The Newest And Most Advanced AI System, Might Prompt A Major Shift In The Way We…
    submitted by /u/liquidocelotYT [link] [comments]  ( 55 min )
    Question: What AI newsletters or substacks about AI do you recommend? 🧠🐱‍💻🤖🚗🚀
    I'm writing a blog post about the top AI Newsletters of 2022 and 2023, and would like to get your recommendations: Substacks, LinkedIn, Medium any A.I. related Newsletter? If possible, also list their Twitter handle. Also mention if they are Op-Eds, lists of links or other defining features if possible. I hope you all have an amazing 2023 btw! submitted by /u/BackgroundResult [link] [comments]  ( 56 min )
    Are there any good open source text generation AI tools?
    Just like how there is Stable Diffusion for images, is there an open source text generation AI tool? Edit: If possible, looking for something with a Web UI that I can put into Google Collab to test out. Thank you! submitted by /u/ironmen12345 [link] [comments]  ( 58 min )
    AI generated music video
    submitted by /u/Branbruce [link] [comments]  ( 57 min )
    Is there an AI that converts text and then Downloads the first image from google search with that text?
    I am suprised more people don’t use that or need it. I have multiple text, but I need a program for it. submitted by /u/Few_Committee5958 [link] [comments]  ( 56 min )
    Text to art AI for simple clip arts?
    I’m a teacher for German as a second language and I sometimes spend hours looking for clip arts about stuff just to have a small pic on the worksheet explaining everything. Most of the good ones are under copyright. I started buying stuff by artists but those are very limited. And I can’t afford to pay 10$ for 1 banana Clipart. I was wondering if any AI is capable of producing simple clip arts? I tried Midjourney already but the results are very futuristic and anything but not simple. submitted by /u/Johnkree [link] [comments]  ( 57 min )
    Comic Style Animation using Stable Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 58 min )
    What are the best AI 'life apps'?
    I really want an AI personal assistant! Until I can have one, what are the best AI life apps for productivity / life organisation, etc? submitted by /u/isuckinlove [link] [comments]  ( 59 min )
    P.O.T.S. Weekly Stable Diffusion AI Art Video HD 30 fps Strange Dreams
    submitted by /u/prfitofthesngularity [link] [comments]  ( 60 min )
    +50 hacks and tips for academic researchers
    +50 chatgpt hacks and tips for academic researchers and other professionals https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4312358 Kindly share and provide your feedback. submitted by /u/Stunning_Form_7690 [link] [comments]  ( 64 min )
    ChatGPT wrote a recipe based on what's in my cabinets. It's actually really tasty.
    Running a little light on food and couldn't figure anything out to make, so I just wrote down everything that I have available and asked ChatGPT to make a recipe that used only what I have available, give an overview of the dish and ingredient choices, and give me some tips about issues that might arise while cooking. Here's what it came up with (Spoiler: I made it and it's really really good): Garbanzo Bean and Cheddar Cheese Soup: Ingredients: 2 yellow onions, diced 2 tablespoons olive oil 1/4 teaspoon garlic powder 1 teaspoon ground mustard 1 teaspoon onion powder 1 teaspoon garlic salt 1 can sweet peas, drained and rinsed 1 can garbanzo beans, drained and rinsed 8 cups chicken broth 2 tablespoons cornstarch 2 tablespoons water 1 cup shredded cheddar cheese…  ( 63 min )
  • Open

    [R] 2022 in review: neuroAI comes of age
    submitted by /u/codename_failure [link] [comments]  ( 63 min )
    [P] Announcing Open Source Text-to-Video Project!
    I am excited to announce a new open-source project aimed at developing a state-of-the-art text-to-video generation model under an MIT licence. Given a text description, our goal is to synthesize a corresponding video that visually represents the content of the text. https://github.com/TextToVideoAI/TextToVideoAI Implementation of https://arxiv.org/abs/2210.02399 We are seeking contributors with a strong background in machine learning and deep learning, as well as experience with Python and relevant libraries such as PyTorch. If you have a passion for video generation and a desire to contribute to the development of a cutting-edge text-to-video generation model, we encourage you to get involved! There are many ways to contribute to the project, including implementing and testing new model architectures, developing new training procedures, enhancing the model's ability to generate high-quality, coherent videos, and adding support for additional video formats and resolutions. We are committed to making this an inclusive and collaborative project, and we welcome contributions from anyone with the skills and expertise to contribute. If you are interested in getting involved, please reach out to the project maintainers or open an issue to discuss your ideas. We can't wait to see what we can achieve together as a community! submitted by /u/DragonProf [link] [comments]  ( 65 min )
    [D] Data cleaning techniques for PDF documents with semantically meaningful parts
    I am seeking insights and best practices for data preprocessing and cleaning in PDF documents. I am interested in extracting only the body text content from a PDF and discarding everything else, such as page numbers, footnotes, headers, and footers (see attached image for an example of semantically meaningful sections). I have noticed that in Microsoft Word, a user can simply drag in a PDF and Word seems to automatically understand which parts are headers, footnotes, etc. I am speculating that Word may be utilizing machine learning techniques to analyze the layout and formatting of the PDF and classify different sections accordingly. Alternatively, Word may be utilizing pre-defined rules or patterns to identify common elements such as headers and footnotes. I know of related techniques for example to extract layout information from receipts and the like (LayoutLM, Xu et al., https://arxiv.org/abs/1912.13318) and tabular data (TableNet, Paliwal et al., https://ieeexplore.ieee.org/document/8978013), but nothing to solve layout extraction in this particular domain. I am curious to know if there are any techniques or algorithms that can replicate this behavior in Word. Any suggestions or recommendations for data cleaning in PDF documents, would be greatly appreciated. Image of PDF with semantically meaningful sections submitted by /u/cm_34978 [link] [comments]  ( 63 min )
    Has google drive started blocking gdown direct download requests based off of hashes for popular machine learning models? [D]
    I am not sure if i've been doing something wrong over the past few days, but Google Colab hasn't been able to use gdown to download files from my Google Drive account even with the files being set to "anyone with the link can view". They don't work if the files are hosted someone else's drive either. I've even downloaded them from someone else's drive, reuploaded the files to my own drive, and it still fails. When I visit the link in the browser the file downloads perfectly fine. With or without my account being logged in. Does anyone know what's going on? submitted by /u/MrBeforeMyTime [link] [comments]  ( 61 min )
    [R]Automatic Insect and plant disease detection using AI by Bhusan Chettri
    Bhusan Chettri explains how Machine Learning and Artificial Intelligence can be used to build automatic systems for detection of insect and plant diseases in agricultural farming. He further discusses its advantages over traditional methods and also talks about potential demerits. Automatic insect and plant disease detection using Artificial Intelligence is an emerging field that is gaining popularity in the agriculture sector. The ability to accurately and efficiently detect insect infestations and plant diseases has numerous applications, including improved crop yields, reduced use of pesticides, and early detection of potential epidemics. Bhusan Chettri says, “In order to understand the automatic detection of insects and plant diseases using AI, it is important to first understand the b…  ( 67 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 59 min )
    [D] Establishing similarity between segmentation annotation distributions
    Say we have two distributions of segmentation annotations (i.e., a bunch of segmentation maps) which I want to establish are 'similar'. To be more specific, I'm working on a research project where we have in-house annotations for images from a public dataset and I want to quanitatively establish that our annotations and the dataset annotations differ in similar ways, or that our annotations 'fall into the distribution of the current dataset'. (I'm aware that I can measure similarity between distributions with a measure like KL-divergence, but what I'm not sure about is how I would establish what level is 'similar enough'.) submitted by /u/uwashingtongold [link] [comments]  ( 64 min )
    [N] Compromised PyTorch-nightly dependency
    submitted by /u/Yajirobe404 [link] [comments]  ( 62 min )
    [D] Is there any research into using neural networks to discover classical algorithms?
    I don't got a PhD in this, so correct me if any of this is wrong: Every problem solvable by a neural network is provably solvable in code, although not necessarily in a useful way - at worst you could generate the pytorch source code and the model weights. Neural networks can discover algorithms during training, and use them internally to accomplish the task. This happens emergently in today's large transformer models; it's part of learning how to solve the problem. While neural networks can do a lot of things that classical algorithms can't, there's also a lot of things that both can do - pathfinding for example. Maybe there's more yet-unknown overlap between them. Stripping away the neural network and running the underlying algorithm could be useful, since classical algorithms tend to run much faster and with less memory. Has there been any research into converting neural networks into code that accomplishes the same thing? My first thought would be to train a network to take another neural network as input and output the corresponding code. You could create a dataset for this by taking various chunks of code and training neural networks to imitate them. submitted by /u/currentscurrents [link] [comments]  ( 65 min )
  • Open

    Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part II
    This is part 2 of a three-part series on the Economics of Ethics. In Part I of the Economics of Ethic series, we talked about economics as a framework for the creation and distribution of society value. With AI’s ability to learn and adapt billions of times faster than humans, society must get the definition… Read More »Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part II The post Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part II appeared first on Data Science Central.  ( 23 min )
  • Open

    RL Reward formulation best practices?
    Hello, Happy new year everyone! I believe the formulation of rewards play a major part in successful learning of an RL agent. Are there general procedures, best practices or textbooks (or chapters of a textbook) available as a guideline to formulate rewards in a given environment. For instance, decisions such as whether to select linear or quadratic rewards, how to combine multiple states (or actions ) together to effectively drive multi-objective optimization. submitted by /u/AIinPowerEnthusiast [link] [comments]  ( 59 min )
  • Open

    Remaking Old Computer Graphics With AI Image Generation
    Can AI Image generation tools make re-imagined, higher-resolution versions of ols video game graphics? Over the last few days, I used AI image generation to reproduce one of my childhood nightmares. I wrestled with Stable Diffusion, Dall-E and Midjourney to see how these commercial AI generation tools can help retell an old visual story - the intro cinematic to an old video game (Nemesis 2 on the MSX). This post describes the process and my experience in using these models/services to retell a story in higher fidelity graphics. Meet Dr. Venom This fine looking gentleman is the villain in a video game. Dr. Venom appears in the intro cinematic of Nemesis 2, a 1987 video game. This image in particular comes at a dramatic reveal in the cinematic. Let’s update these graphics with visual generative AI tools and see how they compare and where each succeeds and fails. Remaking Old Computer graphics with AI Image Generation Here’s a side-by-side look at the panels from the original cinematic (left column) and the final ones generated by the AI tools (right column): This figure does not show the final Dr. Venom graphic because I want you to witness it as I had, in the proper context and alongside the appropriate music. You can watch that here:  ( 7 min )

  • Open

    Illustrated Dracula Novel - Automated process with AI
    submitted by /u/pwillia7 [link] [comments]  ( 54 min )
    A comic about human art imitating AI art imitating human art x
    submitted by /u/ramseyj [link] [comments]  ( 54 min )
    This Artificial Intelligence (AI) Paper Proposes Climate NeRF That Allows People To Visualize What Climate Change Outcomes Will Do To Them
    submitted by /u/ai-lover [link] [comments]  ( 55 min )
    DeepMind and Google Introduce GraphCast: A Fast and Scalable Machine Learning Weather Simulator
    submitted by /u/ai-lover [link] [comments]  ( 52 min )
    The Best Way To Bypass Visually Any AI Text Detection System!
    Using unique and personal phrases /sentence structures and words: This is probably the most effective technique to make your text bypass any AI detector. Just add some words here and there, reword a few words to your liking. This works because the words you put in, instead of the words generated by ChatGPT, throws off the AI detector leading it to believe the text is most likely human as it is unpredictable by its own standards. (Examples plus even more ways to do this are given in the following post, be sure to read the whole thing to effectively bypass any AI detection system!) https://getaditya2008.substack.com/p/protect-your-ai-generated-text-from?sd=pf submitted by /u/iamadityasingh [link] [comments]  ( 55 min )
    Is there an ai that creates layouts for instagram stories creatively automatically by selecting various photos?
    submitted by /u/Redditario [link] [comments]  ( 55 min )
    Riffusion generates AI music from Stable Diffusion images
    submitted by /u/Peaking_AI [link] [comments]  ( 53 min )
    2022: A Year Full of Amazing AI papers - A Review
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 54 min )
    AI generated music video
    submitted by /u/Branbruce [link] [comments]  ( 54 min )
    I built an app that uses AI to search your iPhone Photos offline.
    submitted by /u/RingoCatKeeper [link] [comments]  ( 54 min )
    How Far is Quantum Computing from being Fully Operational? - What that means for a Green Future..
    submitted by /u/Green-Future_ [link] [comments]  ( 71 min )
    Mr.Bean as The Joker - AI generated
    submitted by /u/BearKuda [link] [comments]  ( 54 min )
    Wang released an open-source implementation of ChatGPT, LAION & CasperAI are now training their own (to be launched soon)
    submitted by /u/lambolifeofficial [link] [comments]  ( 52 min )
    Tech-nically Funny: A.I. and Tech Jokes to Bring a Smile to Your Face - ...
    submitted by /u/BallbustCuck [link] [comments]  ( 54 min )
    I created an 'AI News' series on YouTube and I'd love your feedback!
    Title says it all. Im experimenting with a bi-weekly news segment that's intended for keeping people up to date on events in AI from a high level. I just published Episode 2 today and would like feedback on areas for improvement. https://youtu.be/kQk7f6gsPDE Thanks in advance! submitted by /u/Kitten-Smuggler [link] [comments]  ( 56 min )
  • Open

    Books Read in 2022
    At the end of every year I have a tradition where I write summaries of the books that I read throughout the year. Unfortunately this year was exceptionally busy (the postdoc life is a lot more intense than the PhD life) so I didn’t write summaries. I apologize in adevance. You can find the other book-related blog posts from prior years (going back to 2016) in the blog archives. Here are the 17 books I read this past year. I put in parentheses the publication date. The books which I really enjoyed are written in bold. Impact Players: How to Take the Lead, Play Bigger, and Multiply Your Impact (2021) Never Split the Difference: Negotiating As If Your Life Depended On It (2016) Be Exceptional: Master the Five Traits That Set Extraordinary People Apart (2021) The Digital Silk Road: China’s Que…  ( 2 min )
  • Open

    [Discussion] is attention an explanation?
    Can we use attention weights from causal models, as explanations or causal attributes for next word predictions? submitted by /u/Longjumping_Essay498 [link] [comments]  ( 62 min )
    [R] CNN Attention maps comparison
    Hey everyone, I want to compare CNN's visualizations using Grad-CAM to attention heat maps from an eye-tracking tool. I saw that I can calculate importance value for activated neurons on the visualizations. Do you its possible to compare this data? submitted by /u/jeditwisted [link] [comments]  ( 59 min )
    [P] [D] Auxiliary learning tasks achieve state-of-the-art results in transformer-based conversational agents with 3x fewer parameters.
    submitted by /u/radi-cho [link] [comments]  ( 64 min )
    [P] Implementing Convolutional Neural Network for Reverse Engineering
    submitted by /u/Emotional_Aardvark26 [link] [comments]  ( 59 min )
    [P] I built a web app tool to paraphrase, grammar check, and summarize text with GPT-3.
    Website: https://www.wordfixerbot.com My application - WordfixerBot - was built to help users paraphrase, grammar checking, and summarise texts. The paraphraser tool currently offers basic features for language processing tools - paraphrasing tones to choose from and copy function for result text. I am currently working on adding more features to improve it. Background story: I have always wanted to build an AI-based application, and when I first came across OpenAI GPT-3, I was amazed by its powerful NLP model, so I just gave it a try and hoped it might work out :D. I would love to receive any feedback and any recommended features that you guys have for my site. submitted by /u/Austin_Nguyen_2k [link] [comments]  ( 62 min )
    [D] Does it make sense to use dropout and layer normalization in the same model?
    Some time ago I saw an article saying it is not preferred to use dropout and any kind of normalization(like batch or layer) in a model. But I am not sure why. Any suggestion about that? submitted by /u/Beneficial_Law_5613 [link] [comments]  ( 64 min )
    [P] Finetune any Vision Transformer architecture on your custom data 🚀, Convert to TensorFlow Lite ✅
    In this post, I shared my implementation of Vision Transformer model from scratch in TensorFlow 2.x. After some more hours of work, I am super excited to publish the new repository. Now you can finetune any Vision Transformer model on your custom data using just command line. After the finetuning the model can easily be converted to TensorFlow lite ✅ and can be deployed to Android/iOS. It is worth to mention that as the model was pretrained on large datasets, you can get very good accuracy on custom dataset with finetuning. Below I am sharing my implementation of the whole project and I hope some of you will find it useful 😊. Please have a look and give it a star if you like it. Any advice, improvements are welcome🙂. Load any Vision Transformer model, from vit import viT vit_large = viT(vit_size="ViT-LARGE32") vit_large.from_pretrained(pretrained_top=True) Finetuning on custom dataset, python train.py \ --training-data dataset/training_set --test-data dataset/test_set \ --num-classes 2 \ --epochs 2 \ --batch-size 16 \ --vit-size ViT-BASE16 \ --model-name ViT-BASE16_cat_dog \ --save-training-stats The GitHub link to the project can be found here. Thanks for reading guys. :) submitted by /u/TensorDudee [link] [comments]  ( 66 min )
    An Open-Source Version of ChatGPT is Coming [News]
    submitted by /u/lambolifeofficial [link] [comments]  ( 65 min )
    [R] 2022 Top Papers in AI — A Year of Generative Models
    submitted by /u/designer1one [link] [comments]  ( 63 min )
    [D] GPU-enabled scikit-learn
    Hi everyone. I was trying to find a possible gpu-enabled version of scikit-learn, but I was surprised to see that most of the libraries are written for NVIDIA GPUs. While nvidia GPUs are very common, I do feel that Apple’s gpu availability and the popularity of AMD demand a more thorough coverage. I was wondering whether coding scikit-learn on top of pytorch with GPU support could be something the community would be interested in. I do not work for Meta. I would just like to do something useful for the community. Cheers. submitted by /u/Realistic-Bed2658 [link] [comments]  ( 69 min )
  • Open

    New Military-grade Random Bit Sequences Based on Irrational Numbers and Fast Computations
    Irrational numbers such as π may have been the first ones used to create perfect randomness and strong cryptographic systems. They were also among the first ones to be dismissed, long ago. Since then, they were never revisited and are completely abandoned. Binary digits of numbers such as π are remarkable at mimicking randomness. Indeed… Read More »New Military-grade Random Bit Sequences Based on Irrational Numbers and Fast Computations The post New Military-grade Random Bit Sequences Based on Irrational Numbers and Fast Computations appeared first on Data Science Central.  ( 20 min )
  • Open

    Innovation in Digit Identification?
    I got an error rate of 0.12-0.18% on the MNIST digit recognition data with a classic MLP architecture by adding two innovations (I think) to the process. This error rate seems to compare very well with all of the computational methods reported at the MNIST website. The small neural network size (400x130x10 - no padding or center of mass required on images) and performance may be of interest to others so here are some details. First innovation - applied to training data In addition to the images, the MNIST training data can provide a Bayes frequency distribution for each non-zero pixel usage in each digit. These ten distributions were applied to their corresponding image inputs (i.e. distribution for that image label applied to each pixel value) during neural network training. Second innovation - applied to testing The ten frequency distributions were applied to each test image to determine the best fit for identifying the digit. The distribution that resulted in no precision error (i.e. =0.5 == TRUE on all 10 outputs per image) was used as the best fit digit and about 98.33% of the test images were correctly identified using this process. The distribution that provided the minimum precision error was used as the basis to identify the digit on the remaining 167 images for a maximum combined accuracy of 99.88% (i.e. 12 errors in 10k images). A benefit of these innovations is that the MNIST data provides all of the information necessary to achieve that accuracy. The larger input data set and image processing techniques used in CNNs, or the addition of additional tweaked images to the training data set to expand the pixel coverage is not required. Has all of this been done already? submitted by /u/wichAir [link] [comments]  ( 51 min )
    Neuroevolution: How to Stop a Double Pendulum
    submitted by /u/keghn [link] [comments]  ( 51 min )
  • Open

    Need help with the explanations of the stable_baselines3 plots
    Hi, I'm relatively new to RL, and I'm trying to train an agent with PPO to solve a custom environment that I've implemented. I can understand some of these plots, but I wanted to find some resources that would've explained each in more depth. For example, we should have two loss plots, one for the critic network and one for the value. But here we have another plot train/loss. Also, what is the intuition behind explained variance, rollout, etc? I would appreciate it if anyone could give me a brief explanation of each or any references I can read. Thanks https://preview.redd.it/yup9oniju99a1.png?width=1303&format=png&auto=webp&s=46d13423b2f21e6eafd133db0f829294b59a7f23 https://preview.redd.it/h36lzolhu99a1.png?width=1302&format=png&auto=webp&s=775f1b3cfffae9521fb5f7af8dc2c5703ffadd10 https://preview.redd.it/gx5qnaheu99a1.png?width=1296&format=png&auto=webp&s=7acf64c6bcd36ffd2e3163ee843b2f1f20e5c2ef submitted by /u/ahmadreza_hadi [link] [comments]  ( 64 min )
  • Open

    Groups of order 2023
    How many groups are there with 2023 elements? There’s obviously at least one: Z2023, the integers mod 2023. Now 2023 = 7 × 289 = 7 × 17 × 17 and so we could also look at Z7 + Z17 + Z17 where + denotes direct sum. An element of this group has the form […] Groups of order 2023 first appeared on John D. Cook.  ( 5 min )

  • Open

    [D] Good online learning-to-rank models
    Not sure I got the terminology right on this but by online, I mean models that automatically learn directly from user action as and when it happens. Not something which you log to later process and train the model. Gradient Boosted Decision Trees and other models require periodic re-training right? Anyone have experience with this? Is reinforcement learning applicable here? This is new to me, so please ask if you have clarifications. Thanks. submitted by /u/grchelp2018 [link] [comments]  ( 60 min )
    [D] What does the output of COMET metric really mean ?
    I'm trying to understand how I can use COMET to evaluate translation models https://github.com/Unbabel/COMET ? I don't really understand how it was trained the meaning of the outputed values ? https://unbabel.github.io/COMET/html/faqs.html#which-comet-model-should-i-use Thanks for your help submitted by /u/AImSamy [link] [comments]  ( 71 min )
    [D] Which Text to speech is this? I've been looking for days.
    Not sure where else to ask this as I can't find other subreddits to ask this question in. I've heard this AI voice plenty of times. It sounds pretty good and I've seen it used in some videos but I just can't find it anywhere. Play .ht has a similar voice but it doesn't flow as good as this one and makes lots of mistakes. I figured maybe someone here has experience with TTS and they ran into this one at some point. Below I am posting a sample, it's only 50 seconds long. Also, I need this one specifically and I've been searching for days but I can't find it anywhere. https://sndup.net/x628/ submitted by /u/Long8D [link] [comments]  ( 63 min )
    [D] NLP/NLU Research Opportunities which don't require much compute
    Hello Everyone! Are there any research problems in language comprehension and summarization tasks which don't require much compute? I wish to play with NLP/NLU now but compute requirements are enormous.. After reading around, i found that text to video problem is being actively researched and may not require as much compute as bare language models do. Are their any novel ideas in text to video domain not requiring much compute? submitted by /u/WobblySilicon [link] [comments]  ( 69 min )
  • Open

    Questions on CUDA and parallelization
    I'm new to reinforcement learning and unfortunately don't have a good concept of CUDA and parallelization. I am currently using Stable Baselines3 which allows for environment parallelization. I have 1024 CUDA cores, does this mean I can run 1024 parallel environments? If so, would this be wise? Are there any drawbacks to parallelization? submitted by /u/centripetalstranger [link] [comments]  ( 51 min )
    DQN (DrQ) without target network?
    the paper "Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning" achieved SOTA on Atari with DQN + image augmentation (called DrQ) in 2020. However, as I read their paper appendix and supplementary codes in OpenReview, I noticed that their "Target network: update period" is 1. This means the target network is ALWAYS the same as the online network (copy target weight to the online network at every learning step) This is strictly equivalent to not using the target network at all. However, I cannot find any discussion about this choice in the paper, I thought the target network is an almost "standard part" of DQN since nature DQN, so why did they not use it? Even more surprisingly, they explicitly claimed that they used double DQN for updates. This is weird because "double" updates only make sense when the target network exists. If no target exists, double is the same as vanilla DQN. So why bother running an extra forward pass for the double update if it makes no difference? Is there any paper or work that explains these choices? ​ Edit: Also, they did not use "soft target update" either, their code uses tau=1.0, meaning, all weights are copied directly with no weighing (hard target update) submitted by /u/seermer [link] [comments]  ( 51 min )
    Sim2Real Drone
    I am looking to perform Sim2Real transfer on a drone using an RL algorithm that I created. Does anyone please have suggestions for a complete project that I can use as reference? Any git repos or any code? I would like to see that code used for training and the code that run on the hardware submitted by /u/anointedninja [link] [comments]  ( 53 min )
    [VIDEO] AI LEARNS TO PLAY SOCCER USING DEEP REINFORCEMENT LEARNING
    submitted by /u/xWh0am1 [link] [comments]  ( 52 min )
    How to make PPO agent in sb3 to consider only current rewards and give very less importance to future rewards?
    I am working in an environment where the next observation is independent of the action taken. The current state does not affect future rewards, and the current reward is sufficient. I'm curious as to which hyperparameters I ought to apply to my environment for better results. Please let us know if you believe any other agent would be more effective for this kind of problem. thanks. Update: I found out that adding action space of my environment to observation makes it more suitable for RL and performs better. submitted by /u/machinePola [link] [comments]  ( 56 min )
  • Open

    Is it realistic to get masters in ML/AI with an undergrad in MIS?
    I'm interested in ML/AI, but I'm not sure it makes sense for me to go down the computer science career path at all. I like coding and I'm fascinated with the future ML can bring, but work life balance is very important to me. I see lots of new grads with CS degree unable to get a job. Seems like it is started to get oversaturated. Seems like the only way to stay competitive if I were to major in CS is to spend my days on getting high GPA, and then spend every other second actually learning employable skills. I'm very extroverted but I don't mind an isolated 9-5 because I have time to spend with friends. But now it is so competitive I'm not sure it's realistic to have that much free time. Then I look at the Management Information Systems (MIS) majors, they graduate making 80K, it's way easier than CS degree because mix of business and tech, and you don't need to spend the rest of your free time learning skills to actually get a job. Plus if I wanted to get into CS stuff, I could always learn it no matter what degree I have. So, if I go with the MIS major and later decide to get into ML/AI, is it still realistic to get into the field without CS bachelors? Also, those that have went with the CS major, how was the work life balance? submitted by /u/Putin666 [link] [comments]  ( 58 min )
    a virtual Twitch Streamer, who responds to your chats using OpenAI and Text-to-Speech
    hello, so i wanna try making a virtual Twitch Streamer, who responds to your chats using OpenAI and Text-to-Speech, using python? well im new to this and i would like some help! if anyone is intersted here's my discord: bey#1014 or you can just leave a comment lol thanks <3 submitted by /u/beyabay [link] [comments]  ( 56 min )
    "Freedom" Ai Generated Short film I've created
    submitted by /u/Turtlenade [link] [comments]  ( 52 min )
    Is there an AI image generator that you can input a song into?
    I'm a producer and I have unreleased tracks that I want to create art for. Right now I'm entering in text that I think matches the aesthetic I imagine for the song, however, I'm wondering if there's anything out there that I can upload the audio file of my unreleased track and see what the AI spits out submitted by /u/Snops1017 [link] [comments]  ( 51 min )
    The AI Timeline of 2022, Jan to Dec.
    submitted by /u/Mk_Makanaki [link] [comments]  ( 71 min )
    Artificial General Intelligence (AGI) and its Role in Our Future
    submitted by /u/Green-Future_ [link] [comments]  ( 58 min )
    I challenged ChatGPT to a Rap Battle
    submitted by /u/sirkn8 [link] [comments]  ( 57 min )
    I made a tool to track viral AI content across social media
    submitted by /u/Substantial-Web6497 [link] [comments]  ( 59 min )
    What Does The Future Hold?!
    submitted by /u/PuppetHere [link] [comments]  ( 56 min )
    I think I know why AI image generators mess up hands so badly.
    Hands are notoriously difficult, but AI messes them up in ways that humans simply don't. A person can get it wrong, but it would retain the basic structure of a hand. AI, on the other hand, creates messes of fingers. While hands are hell to create, people at least have the luxury of conceptualizing it in 3d. If we were given diagrams of an alien hand at various angles and poses, we'd be able to piece together how they'd look and function. AI image generators only work in 2d, however, and thus can't parse their wildly varying appearances. It would be like us trying to sculpt a 4d being's hand using only 3d models. submitted by /u/PixelJack79 [link] [comments]  ( 57 min )
    Artificially generated image detector?
    Is there a software to detect whether an image was artificially generated or not? For text, I found this, which seems to work fine. Any similar projects for images? submitted by /u/menguzat [link] [comments]  ( 57 min )
    What is the most advanced A.I. avatar image generator from your own photos?
    What is the most advanced A.I. avatar image generator from your own photos? submitted by /u/ArgyleDiamonds [link] [comments]  ( 54 min )
    Credit Goes to Google for ChatGPT's Success
    submitted by /u/Kipyegonn [link] [comments]  ( 52 min )
    ChatGPT Based Online Newspaper Writing About "Eye Liner Linked to Increase in Violent Crimes"
    submitted by /u/Thin_Rush8229 [link] [comments]  ( 64 min )
  • Open

    AI in Recruitment: How is it revolutionizing the hiring process?
    The way we hire has changed dramatically over the last 10 years. Technology has been a major driver of these changes, especially through…  ( 9 min )
    Is machine learning required?
    thinking the unthinkable., Machine Learning  ( 10 min )
  • Open

    Using Neural Networks to create cocktail recipes
    Hello there ! I'm trying to create cocktails corresponding to the user's tastes. So my dataset looks like this : ​ https://preview.redd.it/mq1s3hohj29a1.png?width=1759&format=png&auto=webp&s=6ed4dbec0595141069a290ce9de422f3b9fc5841 Each flavours ang ingredients are in a list, the numbers in the dataset correspond to the ID of the words. I can't figure out how I could train a neural network to create a recipe when the user inputs the flavours he like. Any hints would be appreciable ;) ! submitted by /u/Kdcius [link] [comments]  ( 47 min )
    TypeError: can't convert np.ndarray of type numpy.object_.
    Hey Guy's, so I tried to make a NeuralNetwork for the dataset of Quickdraw, but I always get the error: TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool. How can i fix this, more detail on StackOverflow: https://stackoverflow.com/questions/74962055/typeerror-cant-convert-np-ndarray-of-type-numpy-object submitted by /u/ExperiencePopular706 [link] [comments]  ( 51 min )
  • Open

    Meet the Omnivore: Music Producer Remixes the Holidays With Newfound Passion for 3D Content Creation
    Stephen Tong, aka Funky Boy, has always loved music and photography. He’s now transferring the skills developed over the years as a music producer — shooting time lapses, creating audio tracks and more — to a new passion of his: 3D content creation.  ( 5 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-01-29T00:58:06.015Z osmosfeed 1.15.1